Re: Poor read performance on high-end server

2010-08-09 Thread Freek Dijkstra
Hi all,

Thanks a lot for the great feedback from before the weekend. Since one
of my colleagues needed the machine, I could only do the tests today.

In short: just installing 2.6.35 did make some difference, but I was
mostly impressed with the speedup gained by the hardware acceleration of
the crc32c_intel module.

Here is some quick data.

Reference figures:
16* single disk (theoretical limit): 4092 MiByte/s
fio data layer tests (achievable limit): 3250 MiByte/s
ZFS performance: 2505 MiByte/s

BtrFS figures:
IOzone on 2.6.32: 919 MiByte/s
fio btrfs tests on 2.6.35:   1460 MiByte/s
IOzone on 2.6.35 with crc32c:1250 MiByte/s
IOzone on 2.6.35 with crc32c_intel:  1629 MiByte/s
IOzone on 2.6.35, using -o nodatasum:1955 MiByte/s

For those finding this message and want a howto: the easiest way to use
crc32c_intel is to add the module name to /etc/modules:
 # echo crc32c_intel  /etc/modules
 # reboot

Now the next step for us is to tune the block sizes. We only did that
preliminary, but now that we have a good knowledge of what software to
use, we can start tuning that in more detail.

If there is interest on this list, I'll gladly post our results here.


Jens Axboe wrote:

 Also, I didn't see Chris mention this, but if you have a newer intel box
 you can use hw accellerated crc32c instead. For some reason my test box
 always loads crc32c and not crc32c-intel, so I need to do that manually.
 
 it is pretty annoying to have to do it manually. Sometimes
 you forget. And it's not possible to de-select CRC32C and have
 the intel variant loaded.

You can, but only if you first unmount the partition:

 # unmount /mnt/mybtrfsdisk
 # rmmod btrfs
 # rmmod libcrc32c
 # rmmod crc32c
 # modprobe crc32c_intel
 # mount -t btrfs /dev/sda1 /mnt/mybtrfsdisk




We encountered a small bug: the btrfs partition with RAID0 that was made
on 2.6.32 did not mount after a reboot or after unmounting. Running
btrfsck fixes this, but after a next umount, we had to run btrfsck
again. After recreating the btrfs partition on 2.6.35, all was well.
btrfs partitions that don't use (software) RAID work fine.

~# mount -t btrfs -o ssd /dev/sdd /mnt/ssd3
mount: wrong fs type, bad option, bad superblock on /dev/sdd,
   missing codepage or helper program, or other error
   In some cases useful info is found in syslog - try
   dmesg | tail  or so

~# dmesg | tail
device fsid ec4d518ec61d4496-81e5aeda2d8ef7b5 devid 1 transid 69 /dev/sdd
btrfs: use ssd allocation scheme
btrfs: failed to read the system array on sdd
btrfs: open_ctree failed

~# btrfsck /dev/sdd
found 550511136768 bytes used err is 0
total csum bytes: 536870912
total tree bytes: 755322880
total fs tree bytes: 77824
btree space waste bytes: 169152328
file data blocks allocated: 549755813888
 referenced 549755813888
Btrfs Btrfs v0.19

~# mount -t btrfs -o ssd /dev/sdd /mnt/ssd3
[and it mounts fine now]


Regards,
Freek Dijkstra
SARA High Performance Computing and Networking
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-08 Thread Andi Kleen
Jens Axboe ax...@kernel.dk writes:

 Also, I didn't see Chris mention this, but if you have a newer intel box
 you can use hw accellerated crc32c instead. For some reason my test box
 always loads crc32c and not crc32c-intel, so I need to do that manually.

I have a patch for that, will post it later: autoloading of modules
based on x86 cpuinfo.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-08 Thread Jens Axboe
On 08/08/2010 03:18 AM, Andi Kleen wrote:
 Jens Axboe ax...@kernel.dk writes:

 Also, I didn't see Chris mention this, but if you have a newer intel box
 you can use hw accellerated crc32c instead. For some reason my test box
 always loads crc32c and not crc32c-intel, so I need to do that manually.
 
 I have a patch for that, will post it later: autoloading of modules
 based on x86 cpuinfo.

Great, it is pretty annoying to have to do it manually. Sometimes
you forget. And it's not possible to de-select CRC32C and have
the intel variant loaded.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-06 Thread Chris Mason
On Thu, Aug 05, 2010 at 11:21:06PM +0200, Freek Dijkstra wrote:
 Chris Mason wrote:
 
  Basically we have two different things to tune.  First the block layer
  and then btrfs.
 
 
  And then we need to setup a fio job file that hammers on all the ssds at
  once.  I'd have it use adio/dio and talk directly to the drives.
 
 Thanks. First one disk:
 
  f1: (groupid=0, jobs=1): err= 0: pid=6273
read : io=32780MB, bw=260964KB/s, iops=12, runt=128626msec
  clat (usec): min=74940, max=80721, avg=78449.61, stdev=923.24
  bw (KB/s) : min=240469, max=269981, per=100.10%, avg=261214.77, 
  stdev=2765.91
cpu  : usr=0.01%, sys=2.69%, ctx=1747, majf=0, minf=5153
IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
  =64=0.0%
   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
  =64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
  =64=0.0%
   issued r/w: total=1639/0, short=0/0
  
   lat (msec): 100=100.00%
  
  Run status group 0 (all jobs):
 READ: io=32780MB, aggrb=260963KB/s, minb=267226KB/s, maxb=267226KB/s, 
  mint=128626msec, maxt=128626msec
  
  Disk stats (read/write):
sdd: ios=261901/0, merge=0/0, ticks=10135270/0, in_queue=10136460, 
  util=99.30%
 
 So 255 MiByte/s.
 Out of curiousity, what is the distinction between the reported figures
 of 260964 kiB/s, 261214.77 kiB/s, 267226 kiB/s and 260963 kiB/s?

When there is only one job, they should all be the same.  aggr is the
total seen across all the jobs, min is the lowest, max is the highest.

 
 
 Now 16 disks (abbreviated):
 
  ~/fio# ./fio ssd.fio
  Starting 16 processes
  f1: (groupid=0, jobs=1): err= 0: pid=4756
read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec
  clat (msec): min=75, max=138, avg=96.15, stdev= 4.47
   lat (msec): min=75, max=138, avg=96.15, stdev= 4.47
  bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, 
  stdev=9052.26
cpu  : usr=0.00%, sys=1.71%, ctx=2737, majf=0, minf=5153
IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
  =64=0.0%
   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
  =64=0.0%
   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
  =64=0.0%
   issued r/w: total=1639/0, short=0/0
  
   lat (msec): 100=97.99%, 250=2.01%
  Run status group 0 (all jobs):
 READ: io=524480MB, aggrb=3301MB/s, minb=216323KB/s, maxb=219763KB/s, 
  mint=156406msec, maxt=158893msec

 So, the maximum for these 16 disks is 3301 MiByte/s.
 
 I also tried hardware RAID (2 sets of 8 disks), and got a similar result:
 
  Run status group 0 (all jobs):
 READ: io=65560MB, aggrb=3024MB/s, minb=1548MB/s, maxb=1550MB/s, 
  mint=21650msec, maxt=21681msec

Great, so we know the drives are fast.

 
 
 
  fio should be able to push these devices up to the line speed.  If it
  doesn't I would suggest changing elevators (deadline, cfq, noop) and
  bumping the max request size to the max supported by the device.
 
 3301 MiByte/s seems like a reasonable number, given the theoretic
 maximum of 16 times the single disk performance of 16*256 MiByte/s =
 4096 MiByte/s.
 
 Based on this, I have not looked at tuning. Would you recommend that I do?
 
 Our minimal goal is 2500 MiByte/s; that seems achievable as ZFS was able
 to reach 2750 MiByte/s without tuning.
 
  When we have a config that does so, we can tune the btrfs side of things
  as well.
 
 Some files are created in the root folder of the mount point, but I get
 errors instead of results:
 

Someone else mentioned that btrfs only gained DIO reads in 2.6.35.  I
think you'll get the best results with that kernel if you can find an
update.

If not, you can change the fio job file to remove direct=1 and increase the
bs flag up to 20M.

I'd also suggest changing /sys/class/bdi/btrfs-1/read_ahead_kb to a
bigger number.  Try 20480

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-06 Thread Jens Axboe
On 2010-08-05 16:51, Chris Mason wrote:
 And then we need to setup a fio job file that hammers on all the ssds at
 once.  I'd have it use adio/dio and talk directly to the drives.  I'd do
 something like this for the fio job file, but Jens Axboe is cc'd and he
 might make another suggestion on the job file.  I'd do something like
 this in a file named ssd.fio
 
 [global]
 size=32g
 direct=1
 iodepth=8

iodepth=8 will have no effect if you don't also set a different IO
engine, otherwise you would be using read(2) to fetch the data. So add
ioengine=libaio to take advantage of a higher queue depth as well.

Also, I didn't see Chris mention this, but if you have a newer intel box
you can use hw accellerated crc32c instead. For some reason my test box
always loads crc32c and not crc32c-intel, so I need to do that manually.
That helps a lot with higher transfer rates. You can check support for
hw crc32c by checking for the 'sse4_2' flag in /proc/cpuinfo.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-06 Thread Chris Mason
On Fri, Aug 06, 2010 at 01:55:21PM +0200, Jens Axboe wrote:
 On 2010-08-05 16:51, Chris Mason wrote:
  And then we need to setup a fio job file that hammers on all the ssds at
  once.  I'd have it use adio/dio and talk directly to the drives.  I'd do
  something like this for the fio job file, but Jens Axboe is cc'd and he
  might make another suggestion on the job file.  I'd do something like
  this in a file named ssd.fio
  
  [global]
  size=32g
  direct=1
  iodepth=8
 
 iodepth=8 will have no effect if you don't also set a different IO
 engine, otherwise you would be using read(2) to fetch the data. So add
 ioengine=libaio to take advantage of a higher queue depth as well.

Yeah, I just realized I messed up the suggested file, but it worked well
enough on the block devices, so I think just having 16 procs hitting the
array was enough.  libaio will only help with O_DIRECT though, so this
only applies to 2.6.35 as well.

 
 Also, I didn't see Chris mention this, but if you have a newer intel box
 you can use hw accellerated crc32c instead. For some reason my test box
 always loads crc32c and not crc32c-intel, so I need to do that manually.
 That helps a lot with higher transfer rates. You can check support for
 hw crc32c by checking for the 'sse4_2' flag in /proc/cpuinfo.

Yeah, the HW assisted crc does make a huge difference.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-05 Thread Mathieu Chouquet-Stringer
Hello,

freek.dijks...@sara.nl (Freek Dijkstra) writes:
 [...]

 Here are the exact settings:
 ~# mkfs.btrfs -d raid0 /dev/sdd /dev/sde /dev/sdf /dev/sdg \
  /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm \
  /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds
 nodesize 4096 leafsize 4096 sectorsize 4096 size 2.33TB
 Btrfs Btrfs v0.19

Don't you need to stripe metadata too (with -m raid0)?  Or you may
be limited by your metadata drive?

-- 
Mathieu Chouquet-Stringer   mchou...@free.fr
The sun itself sees not till heaven clears.
 -- William Shakespeare --
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance on high-end server

2010-08-05 Thread Freek Dijkstra
Chris, Daniel and Mathieu,

Thanks for your constructive feedback!

 On Thu, Aug 05, 2010 at 04:05:33PM +0200, Freek Dijkstra wrote:
  ZFS BtrFS
 1 SSD  256 MiByte/s 256 MiByte/s
 2 SSDs 505 MiByte/s 504 MiByte/s
 3 SSDs 736 MiByte/s 756 MiByte/s
 4 SSDs 952 MiByte/s 916 MiByte/s
 5 SSDs1226 MiByte/s 986 MiByte/s
 6 SSDs1450 MiByte/s 978 MiByte/s
 8 SSDs1653 MiByte/s 932 MiByte/s
 16 SSDs   2750 MiByte/s 919 MiByte/s

[...]
 The above results were for Ubuntu 10.04.1 server, with BtrFS v0.19,
 
 Which kernels are those?

For BtrFS: Linux 2.6.32-21-server #32-Ubuntu SMP x86_64 GNU/Linux
For ZFS: FreeBSD 8.1-RELEASE (GENERIC)

(Note that we can currently not upgrade easily due to binary drivers for
the SAS+SATA controllers :(. I'd be happy to push the vendor though, if
you think it makes a difference.)


Daniel J Blueman wrote:

 Perhaps create a new filesystem and mount with 'nodatasum'

I get an improvement: 919 MiByte/s just became 1580 MiByte/s. Not as
fast as it can, but most certainly an improvement.

 existing extents which were previously created will be checked, so
 need to start fresh.

Indeed, also the other way around. I created two test files, while
mounted with and without the -o nodatasum option:
write w/o nodatasum; read w/o nodatasum:  919 ± 43 MiByte/s
write w/o nodatasum; read w/  nodatasum:  922 ± 72 MiByte/s
write w/  nodatasum; read w/o nodatasum: 1082 ± 46 MiByte/s
write w/  nodatasum; read w/  nodatasum: 1586 ± 126 MiByte/s

So even if I remount the disk in the normal way, and read a file created
without checksums, I still get a small improvement :)

(PS: the above tests were repeated 4 times, the last even 8 times. As
you can see from the standard deviation, the results are not always very
accurate. The cause is unknown; CPU load is low.)


Chris Mason wrote:

 Basically we have two different things to tune.  First the block layer
 and then btrfs.


 And then we need to setup a fio job file that hammers on all the ssds at
 once.  I'd have it use adio/dio and talk directly to the drives.
 
 [global]
 size=32g
 direct=1
 iodepth=8
 bs=20m
 rw=read
 
 [f1]
 filename=/dev/sdd
 [f2]
 filename=/dev/sde
 [f3]
 filename=/dev/sdf
[...]
 [f16]
 filename=/dev/sds

Thanks. First one disk:

 f1: (groupid=0, jobs=1): err= 0: pid=6273
   read : io=32780MB, bw=260964KB/s, iops=12, runt=128626msec
 clat (usec): min=74940, max=80721, avg=78449.61, stdev=923.24
 bw (KB/s) : min=240469, max=269981, per=100.10%, avg=261214.77, 
 stdev=2765.91
   cpu  : usr=0.01%, sys=2.69%, ctx=1747, majf=0, minf=5153
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued r/w: total=1639/0, short=0/0
 
  lat (msec): 100=100.00%
 
 Run status group 0 (all jobs):
READ: io=32780MB, aggrb=260963KB/s, minb=267226KB/s, maxb=267226KB/s, 
 mint=128626msec, maxt=128626msec
 
 Disk stats (read/write):
   sdd: ios=261901/0, merge=0/0, ticks=10135270/0, in_queue=10136460, 
 util=99.30%

So 255 MiByte/s.
Out of curiousity, what is the distinction between the reported figures
of 260964 kiB/s, 261214.77 kiB/s, 267226 kiB/s and 260963 kiB/s?


Now 16 disks (abbreviated):

 ~/fio# ./fio ssd.fio
 Starting 16 processes
 f1: (groupid=0, jobs=1): err= 0: pid=4756
   read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec
 clat (msec): min=75, max=138, avg=96.15, stdev= 4.47
  lat (msec): min=75, max=138, avg=96.15, stdev= 4.47
 bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, 
 stdev=9052.26
   cpu  : usr=0.00%, sys=1.71%, ctx=2737, majf=0, minf=5153
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued r/w: total=1639/0, short=0/0
 
  lat (msec): 100=97.99%, 250=2.01%

[..similar for f2 to f16..]

 f1:  read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec
 bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, 
 stdev=9052.26
 f2:  read : io=32780MB, bw=213873KB/s, iops=10, runt=156947msec
 bw (KB/s) : min=151143, max=251508, per=6.33%, avg=213987.34, 
 stdev=8958.86
 f3:  read : io=32780MB, bw=214613KB/s, iops=10, runt=156406msec
 bw (KB/s) : min=149216, max=219037, per=6.35%, avg=214779.89, 
 stdev=9332.99
 f4:  read : io=32780MB, bw=214388KB/s, iops=10, runt=156570msec
 bw (KB/s) : min=148675, max=226298, per=6.35%, avg=214576.51, 
 stdev=8985.03
 f5:  read : io=32780MB, bw=213848KB/s, iops=10, runt=156965msec
 bw (KB/s) : min=144479, max=241414, per=6.33%, avg=213935.81, 
 stdev=10023.68
 f6:  read : io=32780MB, bw=213514KB/s, iops=10,