On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:

On Tue, 27 Sep 2011, Matt Banks wrote:## Advertising

Also, maybe I read it wrong, but why is it that (in the previousthread about hw raid and zpools) zpools with large numbers ofphysical drives (eg 20+) were frowned upon? I know that ZFS!=WAFLThere is no concern with a large number of physical drives in a pool.The primary concern is with the number of drives per vdev. Anyvariation in the latency of the drives hinders performance and eachI/O to a vdev consumes 1 "IOP" across all of the drives in the vdev(or strip) when raidzN is used. Having more vdevs is better forconsistent performance and more available IOPS.Bob

`To expound just a bit on Bob's reply: the reason that large numbers of`

`disks in a RAIDZ* vdev are frowned upon has to do with the fact that`

`IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks`

`are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the`

`same as a 5-disk vdev. Streaming throughput is significantly higher`

`(i.e. it scales as O(N)), but you're unlikely to get that for the vast`

`majority of workloads.`

`Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the`

`situation where the time to resilver X amount of data on a 5-drive RAIDZ`

`is the same as a 30-drive RAIDZ. Given that you're highly likely to`

`store much more data on a larger vdev, your resilver time to replace a`

`drive goes up linearly with the number of drives in a RAIDZ vdev.`

`This leads to this situation: if I have 20 x 1TB drives, here's several`

`possible configurations, and the relative resilver times (relative,`

`because without knowing the exact configuration of the data itself, I`

`can't estimate wall-clock-time resilver times):`

`(a) 5 x 4-disk RAIDZ: 15TB usable, takes N amount of time to replace`

`a failed disk`

(b) 4 x 5-disk RAIDZ: 16TB usable, takes 1.25N time to replace a disk (c) 2 x 10-disk RAIDZ: 18TB Usable, takes 2.5N time to replace a disk (d) 1 x 20-disk RAIDZ: 19TB usable, takes 5N time to replace a disk

`Notice that by doubling the number of drives in a RAIDZ, you double the`

`resilver time for the same amount of data in the ZPOOL.`

`The above also applies to RAIDZ[23], as the additional parity disk`

`doesn't materially impact resilver times in either direction (and, yes,`

`it's not really a "parity disk", I'm just being sloppy).`

`Also, the other main reason is that larger numbers of drives in a single`

`vdev mean there is a higher probability that multiple disk failures will`

`result in loss of data. Richard Elling had some data on the exact`

`calculations, but it boils down to the fact that your chance of total`

`data loss from multiple drive failures goes up MORE THAN LINEARLY by`

`adding drives into a vdev. Thus, a 1x10-disk RAIDZ has well over 2x the`

`chance of failure that 2 x 5-disk RAIDZ zpool has.`

-Erik _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss