Re: [zfs-discuss] Very poor small-block random write performance

Traffanstead, Mike Thu, 19 Jul 2012 18:26:14 -0700

iozone doesn't vary the blocksize during the test, it's a very
artificial test but it's useful for gauging performance under
different scenarios.


So for this test all of the writes would have been 64k blocks, 128k,
etc. for that particular step.

Just as another point of reference I reran the test with a Crucial M4
SSD and the results for 16G/64k were 35mB/s (x5 improvement).

I'll rerun that part of the test with zpool iostat and see what it says.

Mike

On Thu, Jul 19, 2012 at 7:27 PM, Jim Klimov <jimkli...@cos.ru> wrote:
>> This is normal.  The problem is that with zfs 128k block sizes, zfs
>> needs to re-read the original 128k block so that it can compose and
>> write the new 128k block.  With sufficient RAM, this is normally avoided
>> because the original block is already cached in the ARC.
>>
>> If you were to reduce the zfs blocksize to 64k then the performance dive
>> at 64k would go away but there would still be write performance loss at
>> sizes other than a multiple of 64k.
>
>
> I am not sure if I misunderstood the question or Bob's answer,
> but I have a gut feeling it is not fully correct: ZFS block
> sizes for files (filesystem datasets) are, at least by default,
> dynamically-sized depending on the contiguous write size as
> queued by the time a ZFS transaction is closed and flushed to
> disk. In case of RAIDZ layouts, this logical block is further
> striped over several sectors on several disks in one of the
> top-level vdev's, starting with parity sectors for each "row".
>
> So, if the test logically overwrites full blocks of test data
> files, reads for recombination are not needed (but that can
> be checked for with "iostat 1" or "zpool iostat" - to see how
> many reads do happen during write-tests?) Note that some reads
> will show up anyway, i.e. to update ZFS metadata (the block
> pointer tree).
>
> However, if the test file was written in 128K blocks and then
> is rewritten with 64K blocks, then Bob's answer is probably
> valid - the block would have to be re-read once for the first
> rewrite of its half; it might be taken from cache for the
> second half's rewrite (if that comes soon enough), and may be
> spooled to disk as a couple of 64K blocks or one 128K block
> (if both changes come soon after each other - within one TXG).
>
> HTH,
> //Jim Klimov
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Very poor small-block random write performance

Reply via email to