This is normal.  The problem is that with zfs 128k block sizes, zfs
needs to re-read the original 128k block so that it can compose and
write the new 128k block.  With sufficient RAM, this is normally avoided
because the original block is already cached in the ARC.

If you were to reduce the zfs blocksize to 64k then the performance dive
at 64k would go away but there would still be write performance loss at
sizes other than a multiple of 64k.

I am not sure if I misunderstood the question or Bob's answer,
but I have a gut feeling it is not fully correct: ZFS block
sizes for files (filesystem datasets) are, at least by default,
dynamically-sized depending on the contiguous write size as
queued by the time a ZFS transaction is closed and flushed to
disk. In case of RAIDZ layouts, this logical block is further
striped over several sectors on several disks in one of the
top-level vdev's, starting with parity sectors for each "row".

So, if the test logically overwrites full blocks of test data
files, reads for recombination are not needed (but that can
be checked for with "iostat 1" or "zpool iostat" - to see how
many reads do happen during write-tests?) Note that some reads
will show up anyway, i.e. to update ZFS metadata (the block
pointer tree).

However, if the test file was written in 128K blocks and then
is rewritten with 64K blocks, then Bob's answer is probably
valid - the block would have to be re-read once for the first
rewrite of its half; it might be taken from cache for the
second half's rewrite (if that comes soon enough), and may be
spooled to disk as a couple of 64K blocks or one 128K block
(if both changes come soon after each other - within one TXG).

//Jim Klimov

zfs-discuss mailing list

Reply via email to