On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:
On Tue, 27 Sep 2011, Matt Banks wrote:
Also, maybe I read it wrong, but why is it that (in the previous
thread about hw raid and zpools) zpools with large numbers of
physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL
There is no concern with a large number of physical drives in a pool.
The primary concern is with the number of drives per vdev. Any
variation in the latency of the drives hinders performance and each
I/O to a vdev consumes 1 "IOP" across all of the drives in the vdev
(or strip) when raidzN is used. Having more vdevs is better for
consistent performance and more available IOPS.
To expound just a bit on Bob's reply: the reason that large numbers of
disks in a RAIDZ* vdev are frowned upon has to do with the fact that
IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks
are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the
same as a 5-disk vdev. Streaming throughput is significantly higher
(i.e. it scales as O(N)), but you're unlikely to get that for the vast
majority of workloads.
Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the
situation where the time to resilver X amount of data on a 5-drive RAIDZ
is the same as a 30-drive RAIDZ. Given that you're highly likely to
store much more data on a larger vdev, your resilver time to replace a
drive goes up linearly with the number of drives in a RAIDZ vdev.
This leads to this situation: if I have 20 x 1TB drives, here's several
possible configurations, and the relative resilver times (relative,
because without knowing the exact configuration of the data itself, I
can't estimate wall-clock-time resilver times):
(a) 5 x 4-disk RAIDZ: 15TB usable, takes N amount of time to replace
a failed disk
(b) 4 x 5-disk RAIDZ: 16TB usable, takes 1.25N time to replace a disk
(c) 2 x 10-disk RAIDZ: 18TB Usable, takes 2.5N time to replace a disk
(d) 1 x 20-disk RAIDZ: 19TB usable, takes 5N time to replace a disk
Notice that by doubling the number of drives in a RAIDZ, you double the
resilver time for the same amount of data in the ZPOOL.
The above also applies to RAIDZ, as the additional parity disk
doesn't materially impact resilver times in either direction (and, yes,
it's not really a "parity disk", I'm just being sloppy).
Also, the other main reason is that larger numbers of drives in a single
vdev mean there is a higher probability that multiple disk failures will
result in loss of data. Richard Elling had some data on the exact
calculations, but it boils down to the fact that your chance of total
data loss from multiple drive failures goes up MORE THAN LINEARLY by
adding drives into a vdev. Thus, a 1x10-disk RAIDZ has well over 2x the
chance of failure that 2 x 5-disk RAIDZ zpool has.
zfs-discuss mailing list