On Sun, 17 Oct 2010, Edward Ned Harvey wrote:
The default blocksize is 128K. If you are using mirrors, then each
block on disk will be 128K whenever possible. But if you're using
raidzN with a capacity of M disks (M disks useful capacity + N disks
redundancy) then the block size on each individual disk will be 128K
/ M. Right? This is one of the reasons the raidzN resilver code is
inefficient. Since you end up waiting for the slowest seek time of
any one disk in the vdev, and when that's done, the amount of data
you were able to process was at most 128K. Rinse and repeat.
Your idea about what it means for "code" to be inefficient is clearly
vastly different than my own. Regardless, the the physical layout
issues (impacting IOPS requirements) are a reality.
Would it not be wise, when creating raidzN vdev's, to increase the
blocksize to 128K * M? Then, the on-disk blocksize for each disk
could be the same as the mirror on-disk blocksize of 128K. It still
won't resilver as fast as a mirror, but the raidzN resilver would be
accelerated by as much as M times. Right?
This might work for HPC applications with huge files and huge
sequential streaming data rate requirements. It would be detrimental
for the case of small files, or applications which issue many small
writes, and particularly bad for many random synchronous writes.
The only disadvantage that I know of would be wasted space. Every
4K file in a mirror can waste up to 124K of disk space, right? And
in the above described scenario, every 4K file in the raidzN can
waste up to 128K * M of disk space, right? Also, if you have a lot
of these sparse 4K blocks, then the resilver time doesn't actually
improve either. Because you perform one seek, and regardless if you
fetch 128K or 128K*M, you still paid one maximum seek time to fetch
4K of useful data.
The tally of disadvantages are quite large. Note that zfs needs to
write each zfs "block" and you are dramatically increasing the level
of write amplification. Also zfs needs to checksum each whole block
and the checksum adds to the latency. The risk of block corruption is
increased. 128K is already quite large for a block.
Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss