On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:
On Tue, 27 Sep 2011, Matt Banks wrote:

Also, maybe I read it wrong, but why is it that (in the previous thread about hw raid and zpools) zpools with large numbers of physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL

There is no concern with a large number of physical drives in a pool. The primary concern is with the number of drives per vdev. Any variation in the latency of the drives hinders performance and each I/O to a vdev consumes 1 "IOP" across all of the drives in the vdev (or strip) when raidzN is used. Having more vdevs is better for consistent performance and more available IOPS.

Bob

To expound just a bit on Bob's reply: the reason that large numbers of disks in a RAIDZ* vdev are frowned upon has to do with the fact that IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the same as a 5-disk vdev. Streaming throughput is significantly higher (i.e. it scales as O(N)), but you're unlikely to get that for the vast majority of workloads.

Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the situation where the time to resilver X amount of data on a 5-drive RAIDZ is the same as a 30-drive RAIDZ. Given that you're highly likely to store much more data on a larger vdev, your resilver time to replace a drive goes up linearly with the number of drives in a RAIDZ vdev.

This leads to this situation: if I have 20 x 1TB drives, here's several possible configurations, and the relative resilver times (relative, because without knowing the exact configuration of the data itself, I can't estimate wall-clock-time resilver times):

(a) 5 x 4-disk RAIDZ: 15TB usable, takes N amount of time to replace a failed disk
(b)    4 x 5-disk RAIDZ:  16TB usable, takes 1.25N time to replace a disk
(c)    2 x 10-disk RAIDZ:  18TB Usable, takes 2.5N time to replace a disk
(d)    1 x 20-disk RAIDZ:    19TB usable, takes 5N time to replace a disk

Notice that by doubling the number of drives in a RAIDZ, you double the resilver time for the same amount of data in the ZPOOL.

The above also applies to RAIDZ[23], as the additional parity disk doesn't materially impact resilver times in either direction (and, yes, it's not really a "parity disk", I'm just being sloppy).

Also, the other main reason is that larger numbers of drives in a single vdev mean there is a higher probability that multiple disk failures will result in loss of data. Richard Elling had some data on the exact calculations, but it boils down to the fact that your chance of total data loss from multiple drive failures goes up MORE THAN LINEARLY by adding drives into a vdev. Thus, a 1x10-disk RAIDZ has well over 2x the chance of failure that 2 x 5-disk RAIDZ zpool has.

-Erik
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to