Re: [zfs-discuss] Very poor small-block random write performance
2012-07-22 1:24, Bob Friesenhahn пишет: On Sat, 21 Jul 2012, Jim Klimov wrote: During this quick test I did not manage to craft a test which would inflate a file in the middle without touching its other blocks (other than using a text editor which saves the whole file - so that is irrelevant), in order to see if ZFS can "insert" smaller blocks in the middle of an existing file, and whether it would reallocate other blocks to fit the set recordsizes. The POSIX filesystem interface does not support such a thing ('insert'). Presumably the underlying zfs pool could support such a thing if there was a layer on top to request it. The closest equivalent in a POSIX filesystem would be if a previously-null block in a sparse file is updated to hold content. Well then, this concludes the matter: you were right five years ago, and this righteousness still holds up ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On Sat, 21 Jul 2012, Jim Klimov wrote: During this quick test I did not manage to craft a test which would inflate a file in the middle without touching its other blocks (other than using a text editor which saves the whole file - so that is irrelevant), in order to see if ZFS can "insert" smaller blocks in the middle of an existing file, and whether it would reallocate other blocks to fit the set recordsizes. The POSIX filesystem interface does not support such a thing ('insert'). Presumably the underlying zfs pool could support such a thing if there was a layer on top to request it. The closest equivalent in a POSIX filesystem would be if a previously-null block in a sparse file is updated to hold content. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
2012-07-20 5:11, Bob Friesenhahn wrote: On Fri, 20 Jul 2012, Jim Klimov wrote: Zfs data block sizes are fixed size! Only tail blocks are shorter. This is the part I am not sure is either implied by the docs nor confirmed by my practice. But maybe I've missed something... This is something that I am quite certain of. :-) When doing a random write inside a file, the unit of COW is the zfs filesystem blocksize. Well, apparently I was wrong, and Bob was right :) I ran a simple test like this: # zfs create -o compression=off -o dedup=off -o copies=1 rpool/test This should rule out complex storage options for user-data bytes. # cd /rpool/test/ && touch file && ls -lai total 9 4 drwxr-xr-x 2 root root 3 Jul 21 22:38 . 4 drwxr-xr-x 5 root root 5 Jul 21 22:37 .. 8 -rw-r--r-- 1 root root 0 Jul 21 22:38 file So the file's inode number is 8 (above). This is used for zdb inspections (below). # /usr/gnu/bin/dd if=/dev/random bs=1k count=1 >> file; sync; \\ zdb - rpool/test 8 | grep ' L. ' The last line was repeated a few times. Apparently, (as Bob wrote me off-list), changes in the tail block cause it to be read from disk completely, new bytes appended, and written out - up to dataset recordsize. Thus all intermediate blocks of a file should consume full recordsizes, even if it was appended in small portions spread over several TXGs. Replacing kilobytes at locations spawning one or two blocks also caused reallocations and rewrites of zfs recordsized pieces: # /usr/gnu/bin/dd if=/dev/random of=/rpool/test/file bs=1k \\ seek=12 count=10 conv=noerror,notrunc; sync # /usr/gnu/bin/dd if=/dev/random of=/rpool/test/file bs=1k \\ seek=125 count=10 conv=noerror,notrunc; sync # zdb - rpool/test 8 | grep ' L. ' Dataset rpool/test [ZPL], ID 8412, cr_txg 2110110, 289K, 8 objects, rootbp DVA[0]=<0:a69b82a00:200> DVA[1]=<0:264111e00:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=2110309L/2110309P fill=8 cksum=1373a7e215:68dacd0d409:12c745c671d52:25e0616461e20c 0 L1 0:a69b80200:400 0:263f1dc00:400 4000L/400P F=2 B=2110309/2110309 0 L0 0:a61bcb200:2 2L/2P F=1 B=2110309/2110309 2 L0 0:a61b87a00:2 2L/2P F=1 B=2110304/2110304 During this quick test I did not manage to craft a test which would inflate a file in the middle without touching its other blocks (other than using a text editor which saves the whole file - so that is irrelevant), in order to see if ZFS can "insert" smaller blocks in the middle of an existing file, and whether it would reallocate other blocks to fit the set recordsizes. For generic filesystem uses (append, replace 1:1) at least Bob's assessment is right - zfs stores recordsized blocks and one possibly smaller tail block, not a series of random sized blocks as I implied. I might imagine situations like heavily congested systems where zfs might cut corners to get dirty bytes out to disk faster - and not read-merge-write tail blocks, but even if this is implemented at all, it should be a rare condition. //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On 07/19/12 18:24, Traffanstead, Mike wrote: iozone doesn't vary the blocksize during the test, it's a very artificial test but it's useful for gauging performance under different scenarios. So for this test all of the writes would have been 64k blocks, 128k, etc. for that particular step. Just as another point of reference I reran the test with a Crucial M4 SSD and the results for 16G/64k were 35mB/s (x5 improvement). I'll rerun that part of the test with zpool iostat and see what it says. For random writes to work without forcing a lot of read i/o and read-modify-write sequences, set the recordsize on the filesystem used for the test to match the iozone recordsize. For instance: zfs set recordsize=64k $fsname and ensure that the files used for the test are re-created after you make this setting change ("recordsize" is sticky at file creation time). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
iozone doesn't vary the blocksize during the test, it's a very artificial test but it's useful for gauging performance under different scenarios. So for this test all of the writes would have been 64k blocks, 128k, etc. for that particular step. Just as another point of reference I reran the test with a Crucial M4 SSD and the results for 16G/64k were 35mB/s (x5 improvement). I'll rerun that part of the test with zpool iostat and see what it says. Mike On Thu, Jul 19, 2012 at 7:27 PM, Jim Klimov wrote: >> This is normal. The problem is that with zfs 128k block sizes, zfs >> needs to re-read the original 128k block so that it can compose and >> write the new 128k block. With sufficient RAM, this is normally avoided >> because the original block is already cached in the ARC. >> >> If you were to reduce the zfs blocksize to 64k then the performance dive >> at 64k would go away but there would still be write performance loss at >> sizes other than a multiple of 64k. > > > I am not sure if I misunderstood the question or Bob's answer, > but I have a gut feeling it is not fully correct: ZFS block > sizes for files (filesystem datasets) are, at least by default, > dynamically-sized depending on the contiguous write size as > queued by the time a ZFS transaction is closed and flushed to > disk. In case of RAIDZ layouts, this logical block is further > striped over several sectors on several disks in one of the > top-level vdev's, starting with parity sectors for each "row". > > So, if the test logically overwrites full blocks of test data > files, reads for recombination are not needed (but that can > be checked for with "iostat 1" or "zpool iostat" - to see how > many reads do happen during write-tests?) Note that some reads > will show up anyway, i.e. to update ZFS metadata (the block > pointer tree). > > However, if the test file was written in 128K blocks and then > is rewritten with 64K blocks, then Bob's answer is probably > valid - the block would have to be re-read once for the first > rewrite of its half; it might be taken from cache for the > second half's rewrite (if that comes soon enough), and may be > spooled to disk as a couple of 64K blocks or one 128K block > (if both changes come soon after each other - within one TXG). > > HTH, > //Jim Klimov > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
vfs.zfs.txg.synctime_ms: 1000 vfs.zfs.txg.timeout: 5 On Thu, Jul 19, 2012 at 8:47 PM, John Martin wrote: > On 07/19/12 19:27, Jim Klimov wrote: > >> However, if the test file was written in 128K blocks and then >> is rewritten with 64K blocks, then Bob's answer is probably >> valid - the block would have to be re-read once for the first >> rewrite of its half; it might be taken from cache for the >> second half's rewrite (if that comes soon enough), and may be >> spooled to disk as a couple of 64K blocks or one 128K block >> (if both changes come soon after each other - within one TXG). > > > What are the values for zfs_txg_synctime_ms and zfs_txg_timeout > on this system (FreeBSD, IIRC)? > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On 07/19/12 19:27, Jim Klimov wrote: However, if the test file was written in 128K blocks and then is rewritten with 64K blocks, then Bob's answer is probably valid - the block would have to be re-read once for the first rewrite of its half; it might be taken from cache for the second half's rewrite (if that comes soon enough), and may be spooled to disk as a couple of 64K blocks or one 128K block (if both changes come soon after each other - within one TXG). What are the values for zfs_txg_synctime_ms and zfs_txg_timeout on this system (FreeBSD, IIRC)? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On Fri, 20 Jul 2012, Jim Klimov wrote: I am not sure if I misunderstood the question or Bob's answer, but I have a gut feeling it is not fully correct: ZFS block sizes for files (filesystem datasets) are, at least by default, dynamically-sized depending on the contiguous write size as queued by the time a ZFS transaction is closed and flushed to disk. In case of RAIDZ layouts, this logical block is further Zfs data block sizes are fixed size! Only tail blocks are shorter. The underlying representation (how the data block gets stored) depends on if compression, raidz, deduplication, etc., are used. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
This is normal. The problem is that with zfs 128k block sizes, zfs needs to re-read the original 128k block so that it can compose and write the new 128k block. With sufficient RAM, this is normally avoided because the original block is already cached in the ARC. If you were to reduce the zfs blocksize to 64k then the performance dive at 64k would go away but there would still be write performance loss at sizes other than a multiple of 64k. I am not sure if I misunderstood the question or Bob's answer, but I have a gut feeling it is not fully correct: ZFS block sizes for files (filesystem datasets) are, at least by default, dynamically-sized depending on the contiguous write size as queued by the time a ZFS transaction is closed and flushed to disk. In case of RAIDZ layouts, this logical block is further striped over several sectors on several disks in one of the top-level vdev's, starting with parity sectors for each "row". So, if the test logically overwrites full blocks of test data files, reads for recombination are not needed (but that can be checked for with "iostat 1" or "zpool iostat" - to see how many reads do happen during write-tests?) Note that some reads will show up anyway, i.e. to update ZFS metadata (the block pointer tree). However, if the test file was written in 128K blocks and then is rewritten with 64K blocks, then Bob's answer is probably valid - the block would have to be re-read once for the first rewrite of its half; it might be taken from cache for the second half's rewrite (if that comes soon enough), and may be spooled to disk as a couple of 64K blocks or one 128K block (if both changes come soon after each other - within one TXG). HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On Wed, 18 Jul 2012, Michael Traffanstead wrote: I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives with an hpt27xx controller under FreeBSD 10 (but I've seen the same issue with FreeBSD 9). The system has 8gigs and I'm letting FreeBSD auto-size the ARC. Running iozone (from ports), everything is fine for file sizes up to 8GB, but when it runs with a 16GB file the random write performance plummets using 64K record sizes. This is normal. The problem is that with zfs 128k block sizes, zfs needs to re-read the original 128k block so that it can compose and write the new 128k block. With sufficient RAM, this is normally avoided because the original block is already cached in the ARC. If you were to reduce the zfs blocksize to 64k then the performance dive at 64k would go away but there would still be write performance loss at sizes other than a multiple of 64k. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Very poor small-block random write performance
I have an 8 drive ZFS array (RAIDZ2 - 1 Spare) using 5900rpm 2TB SATA drives with an hpt27xx controller under FreeBSD 10 (but I've seen the same issue with FreeBSD 9). The system has 8gigs and I'm letting FreeBSD auto-size the ARC. Running iozone (from ports), everything is fine for file sizes up to 8GB, but when it runs with a 16GB file the random write performance plummets using 64K record sizes. 8G - 64K -> 52mB/s 8G - 128K -> 713mB/s 8G - 256K -> 442mB/s 16G - 64K -> 7mB/s 16G - 128K -> 380mB/s 16G - 256K -> 392mB/s Also, sequential small block performance doesn't show such a dramatic slowdown either. 16G - 64K -> 108mB/s (sequential) There's nothing else using the zpool at the moment, the system is on a separate ssd. I was expecting performance to drop off at 16GB b/c that's well above the available ARC but see that dramatic of a drop off and then the sharp improvement at 128K and 256K is surprising. Are there any configuration settings I should be looking at? Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss