Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 26 Feb 2013, hagai wrote: for what is worth.. I had the same problem and found the answer here - http://forums.freebsd.org/showthread.php?t=27207 Given enough sequential I/O requests, zfs mirrors behave every much like RAID-0 for reads. Sequential prefetch is very important in order to avoid the latencies. While this script may not work perfectly as is for FreeBSD, it was very good at discovering a zfs performance bug (since corrected) and is still an interesting exercise for zfs to see how ZFS ARC caching helps for re-reads. See "http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh";. The script will exercise an initial uncached read from disks, and then a (hopefully) cached re-read from disks. I think that it serves as a useful benchmark. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
Be careful when testing ZFS with ozone, I ran a bunch of stats many years ago that produced results that did not pass a basic sanity check. There was *something* about the ozone test data that ZFS either did not like or liked very much, depending on the specific test. I eventually wrote my own very crude tool to test exactly what our workload was and started getting results that matched the reality we saw. On Jul 17, 2012, at 4:18 PM, Bob Friesenhahn wrote: > On Tue, 17 Jul 2012, Michael Hase wrote: > >> To work around these caching effects just use a file > 2 times the size of >> ram, iostat then shows the numbers really coming from disk. I always test >> like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but >> quite impressive ;-) > > Ok, the iozone benchmark finally completed. The results do suggest that > reading from mirrors substantially improves the throughput. This is > interesting since the results differ (better than) from my 'virgin mount' > test approach: > > Command line used: iozone -a -i 0 -i 1 -y 64 -q 512 -n 8G -g 256G > > KB reclen write rewritereadreread > 8388608 64 572933 1008668 6945355 7509762 > 8388608 128 2753805 2388803 6482464 7041942 > 8388608 256 2508358 2331419 2969764 3045430 > 8388608 512 2407497 2131829 3021579 3086763 >16777216 64 671365 879080 6323844 6608806 >16777216 128 1279401 2286287 6409733 6739226 >16777216 256 2382223 2211097 2957624 3021704 >16777216 512 2237742 2179611 3048039 3085978 >33554432 64 933712 699966 6418428 6604694 >33554432 128 459896 431640 6443848 6546043 >33554432 256 90 430989 2997615 3026246 >33554432 512 427158 430891 3042620 3100287 >67108864 64 426720 427167 6628750 6738623 >67108864 128 419328 422581 153 6743711 >67108864 256 419441 419129 3044352 3056615 >67108864 512 431053 417203 3090652 3112296 > 134217728 64 417668 55434 759351 760994 > 134217728 128 409383 400433 759161 765120 > 134217728 256 408193 405868 763892 766184 > 134217728 512 408114 403473 761683 766615 > 268435456 64 418910 55239 768042 768498 > 268435456 128 408990 399732 763279 766882 > 268435456 256 413919 399386 760800 764468 > 268435456 512 410246 403019 766627 768739 > > Bob > -- > Bob Friesenhahn > bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Paul Kraus Deputy Technical Director, LoneStarCon 3 Sound Coordinator, Schenectady Light Opera Company ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
for what is worth.. I had the same problem and found the answer here - http://forums.freebsd.org/showthread.php?t=27207 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: To work around these caching effects just use a file > 2 times the size of ram, iostat then shows the numbers really coming from disk. I always test like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but quite impressive ;-) Ok, the iozone benchmark finally completed. The results do suggest that reading from mirrors substantially improves the throughput. This is interesting since the results differ (better than) from my 'virgin mount' test approach: Command line used: iozone -a -i 0 -i 1 -y 64 -q 512 -n 8G -g 256G KB reclen write rewritereadreread 8388608 64 572933 1008668 6945355 7509762 8388608 128 2753805 2388803 6482464 7041942 8388608 256 2508358 2331419 2969764 3045430 8388608 512 2407497 2131829 3021579 3086763 16777216 64 671365 879080 6323844 6608806 16777216 128 1279401 2286287 6409733 6739226 16777216 256 2382223 2211097 2957624 3021704 16777216 512 2237742 2179611 3048039 3085978 33554432 64 933712 699966 6418428 6604694 33554432 128 459896 431640 6443848 6546043 33554432 256 90 430989 2997615 3026246 33554432 512 427158 430891 3042620 3100287 67108864 64 426720 427167 6628750 6738623 67108864 128 419328 422581 153 6743711 67108864 256 419441 419129 3044352 3056615 67108864 512 431053 417203 3090652 3112296 134217728 64 417668 55434 759351 760994 134217728 128 409383 400433 759161 765120 134217728 256 408193 405868 763892 766184 134217728 512 408114 403473 761683 766615 268435456 64 418910 55239 768042 768498 268435456 128 408990 399732 763279 766882 268435456 256 413919 399386 760800 764468 268435456 512 410246 403019 766627 768739 Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: The below is with a 2.6 GB test file but with a 26 GB test file (just add another zero to 'count' and wait longer) I see an initial read rate of 618 MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s. To work around these caching effects just use a file > 2 times the size of ram, iostat then shows the numbers really coming from disk. I always test like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but quite impressive ;-) Yes, in the past I have done benchmarking with file size 2X the size of memory. This does not necessary erase all caching because the ARC is smart enough not to toss everything. At the moment I have an iozone benchark run up from 8 GB to 256 GB file size. I see that it has started the 256 GB size now. It may be a while. Maybe a day. In the range of > 600 MB/s other issues may show up (pcie bus contention, hba contention, cpu load). And performance at this level could be just good enough, not requiring any further tuning. Could you recheck with only 4 disks (2 mirror pairs)? If you just get some 350 MB/s it could be the same problem as with my boxes. All sata disks? Unfortunately, I already put my pool into use and can not conveniently destroy it now. The disks I am using are SAS (7200 RPM, 1 GB) but return similar per-disk data rates as the SATA disks I use for the boot pool. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Bob Friesenhahn wrote: On Tue, 17 Jul 2012, Michael Hase wrote: If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. My expectation would be > 200%, as 4 disks are involved. It may not be the perfect 4x scaling, but imho it should be (and is for a scsi system) more than half of the theoretical throughput. This is solaris or a solaris derivative, not linux ;-) Here are some results from my own machine based on the 'virgin mount' test approach. The results show less boost than is reported by a benchmark tool like 'iozone' which sees benefits from caching. I get an initial sequential read speed of 657 MB/s on my new pool which has 1200 MB/s of raw bandwidth (if mirrors could produce 100% boost). Reading the file a second time reports 6.9 GB/s. The below is with a 2.6 GB test file but with a 26 GB test file (just add another zero to 'count' and wait longer) I see an initial read rate of 618 MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s. To work around these caching effects just use a file > 2 times the size of ram, iostat then shows the numbers really coming from disk. I always test like this. a re-read rate of 8.2 GB/s is really just memory bandwidth, but quite impressive ;-) % pfexec zfs create tank/zfstest/defaults % cd /tank/zfstest/defaults % pfexec dd if=/dev/urandom of=random.dat bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s % cd .. % pfexec zfs umount tank/zfstest/defaults % pfexec zfs mount tank/zfstest/defaults % cd defaults % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s % pfexec dd if=/dev/rdsk/c7t5393E8CA21FAd0p0 of=/dev/null bs=128k count=2000 2000+0 records in 2000+0 records out 262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s % bc scale=8 657/150 4.3800 It is very difficult to benchmark with a cache which works so well: % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s This is not my point, I'm pretty sure I did not measure any arc effects - maybe with the one exception of the raid0 test on the scsi array. Don't know why the arc had this effect, filesize was 2x of ram. The point is: I'm searching for an explanation for the relative slowness of a mirror pair of sata disks, or some tuning knobs, or something like "the disks are plain crap", or maybe: zfs throttles sata disks in general (don't know the internals). In the range of > 600 MB/s other issues may show up (pcie bus contention, hba contention, cpu load). And performance at this level could be just good enough, not requiring any further tuning. Could you recheck with only 4 disks (2 mirror pairs)? If you just get some 350 MB/s it could be the same problem as with my boxes. All sata disks? Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. My expectation would be > 200%, as 4 disks are involved. It may not be the perfect 4x scaling, but imho it should be (and is for a scsi system) more than half of the theoretical throughput. This is solaris or a solaris derivative, not linux ;-) Here are some results from my own machine based on the 'virgin mount' test approach. The results show less boost than is reported by a benchmark tool like 'iozone' which sees benefits from caching. I get an initial sequential read speed of 657 MB/s on my new pool which has 1200 MB/s of raw bandwidth (if mirrors could produce 100% boost). Reading the file a second time reports 6.9 GB/s. The below is with a 2.6 GB test file but with a 26 GB test file (just add another zero to 'count' and wait longer) I see an initial read rate of 618 MB/s and a re-read rate of 8.2 GB/s. The raw disk can transfer 150 MB/s. % zpool status pool: tank state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scan: scrub repaired 0 in 0h10m with 0 errors on Mon Jul 16 04:30:48 2012 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0ONLINE 0 0 0 c7t5393E8CA21FAd0p0 ONLINE 0 0 0 c11t5393D8CA34B2d0p0 ONLINE 0 0 0 mirror-1ONLINE 0 0 0 c8t5393E8CA2066d0p0 ONLINE 0 0 0 c12t5393E8CA2196d0p0 ONLINE 0 0 0 mirror-2ONLINE 0 0 0 c9t5393D8CA82A2d0p0 ONLINE 0 0 0 c13t5393E8CA2116d0p0 ONLINE 0 0 0 mirror-3ONLINE 0 0 0 c10t5393D8CA59C2d0p0 ONLINE 0 0 0 c14t5393D8CA828Ed0p0 ONLINE 0 0 0 errors: No known data errors % pfexec zfs create tank/zfstest % pfexec zfs create tank/zfstest/defaults % cd /tank/zfstest/defaults % pfexec dd if=/dev/urandom of=random.dat bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 36.8133 s, 71.2 MB/s % cd .. % pfexec zfs umount tank/zfstest/defaults % pfexec zfs mount tank/zfstest/defaults % cd defaults % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 3.99229 s, 657 MB/s % pfexec dd if=/dev/rdsk/c7t5393E8CA21FAd0p0 of=/dev/null bs=128k count=2000 2000+0 records in 2000+0 records out 262144000 bytes (262 MB) copied, 1.74532 s, 150 MB/s % bc scale=8 657/150 4.3800 It is very difficult to benchmark with a cache which works so well: % dd if=random.dat of=/dev/null bs=128k count=2 2+0 records in 2+0 records out 262144 bytes (2.6 GB) copied, 0.379147 s, 6.9 GB/s Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
sorry to insist, but still no real answer... On Mon, 16 Jul 2012, Bob Friesenhahn wrote: On Tue, 17 Jul 2012, Michael Hase wrote: So only one thing left: mirror should read 2x I don't think that mirror should necessarily read 2x faster even though the potential is there to do so. Last I heard, zfs did not include a special read scheduler for sequential reads from a mirrored pair. As a result, 50% of the time, a read will be scheduled for a device which already has a read scheduled. If this is indeed true, the typical performance would be 150%. There may be some other scheduling factor (e.g. estimate of busyness) which might still allow zfs to select the right side and do better than that. If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. My expectation would be > 200%, as 4 disks are involved. It may not be the perfect 4x scaling, but imho it should be (and is for a scsi system) more than half of the theoretical throughput. This is solaris or a solaris derivative, not linux ;-) It is really difficult to measure zfs read performance due to caching effects. One way to do it is to write a large file (containing random data such as returned from /dev/urandom) to a zfs filesystem, unmount the filesystem, remount the filesystem, and then time how long it takes to read the file once. The reason why this works is because remounting the filesystem restarts the filesystem cache. Ok, did a zpool export/import cycle between the dd read and write test. This really empties the arc, checked this with arc_summary.pl. the test even uses two processes in parallel (doesn't make a difference). Result is still the same: dd write: 2x 58 MB/sec --> perfect, each disk does > 110 MB/sec dd read: 2x 68 MB/sec --> imho too slow, about 68 MB/sec per disk For writes each disk gets 900 128k io requests/sec with asvc_t in the 8-9 msec range. For reads each disk only gets 500 io requests/sec, asvc_t 18-20 msec with the default zfs_vdev_maxpending=10. When reducing zfs_vdev_maxpending the asvc_t drops accordingly, the i/o rate remains at 500/sec per disk, throughput also the same. I think iostat values should be reliable here. These high iops numbers make sense as we work on empty pools so there aren't very high seek times. All benchmarks (dd, bonnie, will try iozone) lead to the same result: on the sata mirror pair read performance is in the range of a single disk. For the sas disks (only two available for testing) and for the scsi system there is quite good throughput scaling. Here for comparison a table for 1-4 36gb 15k u320 scsi disks on an old sxde box (nevada b130): seq write factor seq read factor MB/sec MB/sec single821 78 1 mirror791137 1.75 2x mirror1201.5 251 3.2 This is exactly what's imho to be expected from mirrors and striped mirrors. It just doesn't happen for my sata pool. Still have no reference numbers for other sata pools, just one with the 4k/512bytes sector problem which is even slower than mine. It seems the zfs performance people just use sas disks and be done. Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ old ibm dual opteron intellistation with external hp msa30, 36gb 15k u320 scsi disks pool: scsi1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM scsi1 ONLINE 0 0 0 c3t4d0ONLINE 0 0 0 errors: No known data errors Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zfssingle 16G 137 99 82739 20 39453 9 314 99 78251 7 856.9 8 Latency 160ms4799ms5292ms 43210us3274ms2069ms Version 1.96 --Sequential Create-- Random Create zfssingle -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 8819 34 + +++ 26318 68 20390 73 + +++ 26846 72 Latency 16413us 108us 231us 12206us 46us 124us 1.96,1.96,zfssingle,1,1342514790,16G,,137,99,82739,20,39453,9,314,99,78251,7,856.9,8,16,8819,34,+,+++,26318,68,20390,73,+,+++,26846,72,160ms,4799ms,5292ms,43210us,3274ms,2069ms,16413us,108us,231us,12206us,46us,124us ## pool: scsi1 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM scsi1 ONLINE 0
Re: [zfs-discuss] zfs sata mirror slower than single disk
> From: Michael Hase [mailto:mich...@edition-software.de] > Sent: Monday, July 16, 2012 6:41 PM > > > So only one thing left: mirror should read 2x > That is still weird - But all your numbers so far are coming from bonnie. Why don't you do a test like this? (below) Write a big file to mirror. Reboot (or something) to clear cache. Now time read the file. Sometimes you'll get a different result with dd versus cat. > Could someone please send me some bonnie++ results for a 2 disk mirror or > a 2x2 disk mirror pool with sata disks? I don't have bonnie, but I have certainly confirmed mirror performance on solaris before with sata disks. I've generally done iozone, benchmarking the N-way mirror, and the stripe-of-mirrors. So I know the expectation in this case is correct. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Tue, 17 Jul 2012, Michael Hase wrote: So only one thing left: mirror should read 2x I don't think that mirror should necessarily read 2x faster even though the potential is there to do so. Last I heard, zfs did not include a special read scheduler for sequential reads from a mirrored pair. As a result, 50% of the time, a read will be scheduled for a device which already has a read scheduled. If this is indeed true, the typical performance would be 150%. There may be some other scheduling factor (e.g. estimate of busyness) which might still allow zfs to select the right side and do better than that. If you were to add a second vdev (i.e. stripe) then you should see very close to 200% due to the default round-robin scheduling of the writes. It is really difficult to measure zfs read performance due to caching effects. One way to do it is to write a large file (containing random data such as returned from /dev/urandom) to a zfs filesystem, unmount the filesystem, remount the filesystem, and then time how long it takes to read the file once. The reason why this works is because remounting the filesystem restarts the filesystem cache. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Michael Hase got some strange results, please see attachements for exact numbers and pool config: seq write factor seq read factor MB/sec MB/sec single1231135 1 raid0 1141249 2 mirror 570.5 129 1 I agree with you these look wrong. Here is what you should expect: seq W seq R single 1.0 1.0 stripe 2.0 2.0 mirror 1.0 2.0 You have three things wrong: (a) stripe should write 2x (b) mirror should write 1x (c) mirror should read 2x I would have simply said "for some reason your drives are unable to operate concurrently" but you have the stripe read 2x. I cannot think of a single reason that the stripe should be able to read 2x, and the mirror only 1x. Yes, I think so too. In the meantime I switched the two disks to another box (hp xw8400, 2 xeon 5150 cpus, 16gb ram). On this machine I did the previous sas tests. OS is now OpenIndiana 151a (vs OpenSolaris b130 before), the mirror pool was upgraded from version 22 to 28, the raid0 pool newly created. The results look quite different: seq write factor seq read factor MB/sec MB/sec raid0 2362330 2.5 mirror1111128 1 Now the raid0 case shows excellent performance, the 330 MB/sec are a bit on the optimistic side, maybe some arc cache effects (file size 32gb, 16gb ram). iostat during sequential read shows about 115 MB/sec from each disk, which is great. The (really desired) mirror case still has a problem with sequential reads. sequential writes to the mirror are twice as fast as before, and show the expected performance for a single disk. So only one thing left: mirror should read 2x I suspect the difference is not the hardware, both boxess should have enough horsepower to easily do sequential reads with way more than 200 MB/sec. In all tests cpu time (user and system) remained quite low. I think it's an OS issue: OpenSolaris b130 is over 2 years old, OI 151a dates 11/2011. Could someone please send me some bonnie++ results for a 2 disk mirror or a 2x2 disk mirror pool with sata disks? Michael -- Michael Hase http://edition-software.de ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Bob Friesenhahn wrote: On Mon, 16 Jul 2012, Michael Hase wrote: This is my understanding of zfs: it should load balance read requests even for a single sequential reader. zfs_prefetch_disable is the default 0. And I can see exactly this scaling behaviour with sas disks and with scsi disks, just not on this sata pool. Is the BIOS configured to use AHCI mode or is it using IDE mode? Not relevant here, disks are connected to an onboard sas hba (lsi 1068, see first post), hardware is a primergy rx330 with 2 qc opterons. Are the disks 512 byte/sector or 4K? 512 byte/sector, HDS721010CLA330 Maybe it's a corner case which doesn't matter in real world applications? The random seek values in my bonnie output show the expected performance boost when going from one disk to a mirrored configuration. It's just the sequential read/write case, that's different for sata and sas disks. I don't have a whole lot of experience with SATA disks but it is my impression that you might see this sort of performance if the BIOS was configured so that the drives were used as IDE disks. If not that, then there must be a bottleneck in your hardware somewhere. With early nevada releases I had indeed the IDE/AHCI problem, albeit on different hardware. Solaris only ran in IDE mode, disks were 4 times slower than on linux, see http://www.oracle.com/webfolder/technetwork/hcl/data/components/details/intel/sol_10_05_08/2999.html Wouldn't a hardware bottleneck show up on raw dd tests as well? I can stream > 130 MB/sec from each of the two disks in parallel. dd reading from more than these two disks at the same time results in a slight slowdown, but here we talk about nearly 400 MB/sec aggregated bandwidth through the onboard hba, the box has 6 disk slots: extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 94.50.0 94.50.0 0.0 1.00.0 10.5 0 100 c13t6d0 94.50.0 94.50.0 0.0 1.00.0 10.6 0 100 c13t1d0 93.00.0 93.00.0 0.0 1.00.0 10.7 0 100 c13t2d0 94.50.0 94.50.0 0.0 1.00.0 10.5 0 100 c13t5d0 Don't know why this is a bit slower, maybe some pci-e bottleneck. Or something with the mpt driver, intrstat shows only one cpu handles all mpt interrupts. Or even the slow cpus? These are 1.8ghz opterons. During sequential reads from the zfs mirror I see > 1000 interrupts/sec on one cpu. So it could really be a bottleneck somewhere triggerd by the "smallish" 128k i/o requests from the zfs side. I think I'll benchmark again on a xeon box with faster cpus, my tests with sas disks were done on this other box. Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Michael Hase > > got some strange results, please see > attachements for exact numbers and pool config: > >seq write factor seq read factor >MB/sec MB/sec > single1231135 1 > raid0 1141249 2 > mirror 570.5 129 1 I agree with you these look wrong. Here is what you should expect: seq W seq R single 1.0 1.0 stripe 2.0 2.0 mirror 1.0 2.0 You have three things wrong: (a) stripe should write 2x (b) mirror should write 1x (c) mirror should read 2x I would have simply said "for some reason your drives are unable to operate concurrently" but you have the stripe read 2x. I cannot think of a single reason that the stripe should be able to read 2x, and the mirror only 1x. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Michael Hase wrote: This is my understanding of zfs: it should load balance read requests even for a single sequential reader. zfs_prefetch_disable is the default 0. And I can see exactly this scaling behaviour with sas disks and with scsi disks, just not on this sata pool. Is the BIOS configured to use AHCI mode or is it using IDE mode? Are the disks 512 byte/sector or 4K? Maybe it's a corner case which doesn't matter in real world applications? The random seek values in my bonnie output show the expected performance boost when going from one disk to a mirrored configuration. It's just the sequential read/write case, that's different for sata and sas disks. I don't have a whole lot of experience with SATA disks but it is my impression that you might see this sort of performance if the BIOS was configured so that the drives were used as IDE disks. If not that, then there must be a bottleneck in your hardware somewhere. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Bob Friesenhahn wrote: On Mon, 16 Jul 2012, Stefan Ring wrote: It is normal for reads from mirrors to be faster than for a single disk because reads can be scheduled from either disk, with different I/Os being handled in parallel. That assumes that there *are* outstanding requests to be scheduled in parallel, which would only happen with multiple readers or a large read-ahead buffer. That is true. Zfs tries to detect the case of sequential reads and requests to read more data than the application has already requested. In this case the data may be prefetched from the other disk before the application has requested it. This is my understanding of zfs: it should load balance read requests even for a single sequential reader. zfs_prefetch_disable is the default 0. And I can see exactly this scaling behaviour with sas disks and with scsi disks, just not on this sata pool. zfs_vdev_max_pending is already tuned down to 3 as recommended for sata disks, iostat -Mxnz 2 looks something like r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 507.10.0 63.40.0 0.0 2.90.05.8 1 99 c13t5d0 477.60.0 59.70.0 0.0 2.80.05.8 1 94 c13t4d0 when reading from the zfs mirror. The default zfs_vdev_max_pending=10 leads to much higher service times in the 20-30msec range, throughput remains roughly the same. I can read from the dsk or rdsk devices in parallel with real platter speeds: dd if=/dev/dsk/c13t4d0s0 of=/dev/null bs=1024k count=8192 & dd if=/dev/dsk/c13t5d0s0 of=/dev/null bs=1024k count=8192 & extended device statistics r/sw/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 2467.50.0 134.90.0 0.0 0.90.00.4 1 87 c13t5d0 2546.50.0 139.30.0 0.0 0.80.00.3 1 84 c13t4d0 So I think there is no problem with the disks. Maybe it's a corner case which doesn't matter in real world applications? The random seek values in my bonnie output show the expected performance boost when going from one disk to a mirrored configuration. It's just the sequential read/write case, that's different for sata and sas disks. Michael Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Stefan Ring wrote: It is normal for reads from mirrors to be faster than for a single disk because reads can be scheduled from either disk, with different I/Os being handled in parallel. That assumes that there *are* outstanding requests to be scheduled in parallel, which would only happen with multiple readers or a large read-ahead buffer. That is true. Zfs tries to detect the case of sequential reads and requests to read more data than the application has already requested. In this case the data may be prefetched from the other disk before the application has requested it. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
> It is normal for reads from mirrors to be faster than for a single disk > because reads can be scheduled from either disk, with different I/Os being > handled in parallel. That assumes that there *are* outstanding requests to be scheduled in parallel, which would only happen with multiple readers or a large read-ahead buffer. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Mon, 16 Jul 2012, Stefan Ring wrote: I wouldn't expect mirrored read to be faster than single-disk read, because the individual disks would need to read small chunks of data with holes in-between. Regardless of the holes being read or not, the disk will spin at the same speed. It is normal for reads from mirrors to be faster than for a single disk because reads can be scheduled from either disk, with different I/Os being handled in parallel. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
> 2) in the mirror case the write speed is cut by half, and the read > speed is the same as a single disk. I'd expect about twice the > performance for both reading and writing, maybe a bit less, but > definitely more than measured. I wouldn't expect mirrored read to be faster than single-disk read, because the individual disks would need to read small chunks of data with holes in-between. Regardless of the holes being read or not, the disk will spin at the same speed. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Jul 16, 2012, at 2:43 AM, Michael Hase wrote: > Hello list, > > did some bonnie++ benchmarks for different zpool configurations > consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512 > bytes/sector, 7.2k), and got some strange results, please see > attachements for exact numbers and pool config: > > seq write factor seq read factor > MB/sec MB/sec > single1231135 1 > raid0 1141249 2 > mirror 570.5 129 1 > > Each of the disks is capable of about 135 MB/sec sequential reads and > about 120 MB/sec sequential writes, iostat -En shows no defects. Disks > are 100% busy in all tests, and show normal service times. For 7,200 rpm disks, average service times should be on the order of 10ms writes and 13ms reads. If you see averages > 20ms, then you are likely running into scheduling issues. -- richard > This is on > opensolaris 130b, rebooting with openindiana 151a live cd gives the > same results, dd tests give the same results, too. Storage controller > is an lsi 1068 using mpt driver. The pools are newly created and > empty. atime on/off doesn't make a difference. > > Is there an explanation why > > 1) in the raid0 case the write speed is more or less the same as a > single disk. > > 2) in the mirror case the write speed is cut by half, and the read > speed is the same as a single disk. I'd expect about twice the > performance for both reading and writing, maybe a bit less, but > definitely more than measured. > > For comparison I did the same tests with 2 old 2.5" 36gb sas 10k disks > maxing out at about 50-60 MB/sec on the outer tracks. > > seq write factor seq read factor > MB/sec MB/sec > single 381 50 1 > raid0 892111 2 > mirror 361 92 2 > > Here we get the expected behaviour: raid0 with about double the > performance for reading and writing, mirror about the same performance > for writing, and double the speed for reading, compared to a single > disk. An old scsi system with 4x2 mirror pairs also shows these > scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec > write, each disk capable of 80 MB/sec. I don't care about absolute > numbers, just don't get why the sata system is so much slower than > expected, especially for a simple mirror. Any ideas? > > Thanks, > Michael > > -- > Michael Hase > http://edition-software.de___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs sata mirror slower than single disk
Hello list, did some bonnie++ benchmarks for different zpool configurations consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512 bytes/sector, 7.2k), and got some strange results, please see attachements for exact numbers and pool config: seq write factor seq read factor MB/sec MB/sec single1231135 1 raid0 1141249 2 mirror 570.5 129 1 Each of the disks is capable of about 135 MB/sec sequential reads and about 120 MB/sec sequential writes, iostat -En shows no defects. Disks are 100% busy in all tests, and show normal service times. This is on opensolaris 130b, rebooting with openindiana 151a live cd gives the same results, dd tests give the same results, too. Storage controller is an lsi 1068 using mpt driver. The pools are newly created and empty. atime on/off doesn't make a difference. Is there an explanation why 1) in the raid0 case the write speed is more or less the same as a single disk. 2) in the mirror case the write speed is cut by half, and the read speed is the same as a single disk. I'd expect about twice the performance for both reading and writing, maybe a bit less, but definitely more than measured. For comparison I did the same tests with 2 old 2.5" 36gb sas 10k disks maxing out at about 50-60 MB/sec on the outer tracks. seq write factor seq read factor MB/sec MB/sec single 381 50 1 raid0 892111 2 mirror 361 92 2 Here we get the expected behaviour: raid0 with about double the performance for reading and writing, mirror about the same performance for writing, and double the speed for reading, compared to a single disk. An old scsi system with 4x2 mirror pairs also shows these scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec write, each disk capable of 80 MB/sec. I don't care about absolute numbers, just don't get why the sata system is so much slower than expected, especially for a simple mirror. Any ideas? Thanks, Michael -- Michael Hase http://edition-software.de pool: ptest state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM ptest ONLINE 0 0 0 c13t4d0 ONLINE 0 0 0 Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zfssingle 32G79 98 123866 51 63626 35 255 99 135359 25 530.6 13 Latency 333ms 111ms5283ms 73791us 465ms2535ms Version 1.96 --Sequential Create-- Random Create zfssingle -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 4536 40 + +++ 14140 50 10382 69 + +++ 6260 73 Latency 21655us 154us 206us 24539us 46us 405us 1.96,1.96,zfssingle,1,1342165334,32G,,79,98,123866,51,63626,35,255,99,135359,25,530.6,13,16,4536,40,+,+++,14140,50,10382,69,+,+++,6260,73,333ms,111ms,5283ms,73791us,465ms,2535ms,21655us,154us,206us,24539us,46us,405us ### pool: ptest state: ONLINE scan: none requested config: NAMESTATE READ WRITE CKSUM ptest ONLINE 0 0 0 c13t4d0 ONLINE 0 0 0 c13t5d0 ONLINE 0 0 0 Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP zfsstripe 32G78 98 114243 46 72938 37 192 77 249022 44 815.1 20 Latency 483ms 106ms5179ms3613ms 259ms1567ms Version 1.96 --Sequential Create-- Random Create zfsstripe -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 6474 53 + +++ 15505 47 8562 81 + +++ 10839 65 Latency 21894us 131us 208us 22203us 52us 230us 1.96,1.96,zfsstripe,1,1342172768,32G,,78,98,114243,46,72938,37,192,77,249022,44,815.1,20,16,6474,53,+,+++,15505,47,8562,81,+,+++,10839,65,483ms,106ms,5179ms,3613ms,259ms,1567ms,21894us,131us,208us,22203us,52us,230us pool: ptest state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ptestONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c13t4d0 ONLINE 0 0 0