On Oct 17, 2010, at 6:38 AM, Edward Ned Harvey wrote: > The default blocksize is 128K. If you are using mirrors, then each block on > disk will be 128K whenever possible. But if you're using raidzN with a > capacity of M disks (M disks useful capacity + N disks redundancy) then the > block size on each individual disk will be 128K / M. Right?
Yes, but it is worse for RAID-5 where you will likely have to do a RMW if your stripe size is not perfectly matched to the blocksize. This is the case where raidz shines over the alternatives. > This is one of the reasons the raidzN resilver code is inefficient. Since > you end up waiting for the slowest seek time of any one disk in the vdev, and > when that's done, the amount of data you were able to process was at most > 128K. Rinse and repeat. How is this different than all other RAID implementations? > Would it not be wise, when creating raidzN vdev's, to increase the blocksize > to 128K * M? Then, the on-disk blocksize for each disk could be the same as > the mirror on-disk blocksize of 128K. It still won't resilver as fast as a > mirror, but the raidzN resilver would be accelerated by as much as M times. > Right? We had this discussion in 2007, IIRC. The bottom line was that if you have a fixed record size workload, then set the appropriate recordsize and it will make sense to adjust your raidz1 configuration to avoid gaps. For raidz2/3 or mixed record length workloads, is not clear that matching the number of data/parity disks offers any advantage. > The only disadvantage that I know of would be wasted space. Every 4K file in > a mirror can waste up to 124K of disk space, right? No. 4K files have recordsize of 4K. This is why we refer to this case as a mixed record size workloads. Remember, the recordsize parameter is a maximum limit, not a minimum limit. > And in the above described scenario, every 4K file in the raidzN can waste up > to 128K * M of disk space, right? No. > Also, if you have a lot of these sparse 4K blocks, then the resilver time > doesn't actually improve either. Because you perform one seek, and > regardless if you fetch 128K or 128K*M, you still paid one maximum seek time > to fetch 4K of useful data. Seek penalties are hard to predict or model. Modern drives have efficient algorithms and large buffer caches. It cannot be predicted whether the next read will be in the buffer cache already. Indeed, it is not even possible to predict the read order. The only sure-fire way to prevent seeks is to use SSDs. > Point is: If the goal is to reduce the number of on-disk slabs, and > therefore reduce the number of seeks necessary to resilver, one thing you > could do is increase the pool blocksize, right? No the pool block size, the application's block size. Applications which make lots of itty bitty I/Os will tend to take more time to resilved. Applications that make lots of large I/Os will resilver faster. > YMMV, and YM will depend on how you use your pool. Hopefully you're able to > bias your usage in favor of large block writes. Yep, it depends entirely on how you use the pool. As soon as you come up with a credible model to predict that, then we can optimize accordingly :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com USENIX LISA '10 Conference, November 7-12, San Jose, CA ZFS and performance consulting http://www.RichardElling.com
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss