On Sun, 17 Oct 2010, Edward Ned Harvey wrote:


The default blocksize is 128K.  If you are using mirrors, then each block on disk will be 128K whenever possible.  But if you're using raidzN with a capacity of M disks (M disks useful capacity + N disks redundancy) then the block size on each individual disk will be 128K / M.  Right?  This is one of the reasons the raidzN resilver code is inefficient.  Since you end up waiting for the slowest seek time of any one disk in the vdev, and when that's done, the amount of data you were able to process was at most 128K.  Rinse and repeat.

Your idea about what it means for "code" to be inefficient is clearly vastly different than my own. Regardless, the the physical layout issues (impacting IOPS requirements) are a reality.

Would it not be wise, when creating raidzN vdev's, to increase the blocksize to 128K * M?  Then, the on-disk blocksize for each disk could be the same as the mirror on-disk blocksize of 128K.  It still won't resilver as fast as a mirror, but the raidzN resilver would be accelerated by as much as M times.  Right?

This might work for HPC applications with huge files and huge sequential streaming data rate requirements. It would be detrimental for the case of small files, or applications which issue many small writes, and particularly bad for many random synchronous writes.

The only disadvantage that I know of would be wasted space.  Every 4K file in a mirror can waste up to 124K of disk space, right?  And in the above described scenario, every 4K file in the raidzN can waste up to 128K * M of disk space, right?  Also, if you have a lot of these sparse 4K blocks, then the resilver time doesn't actually improve either.  Because you perform one seek, and regardless if you fetch 128K or 128K*M, you still paid one maximum seek time to fetch 4K of useful data.

The tally of disadvantages are quite large. Note that zfs needs to write each zfs "block" and you are dramatically increasing the level of write amplification. Also zfs needs to checksum each whole block and the checksum adds to the latency. The risk of block corruption is increased. 128K is already quite large for a block.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to