Re: [zfs-discuss] All (pure) SSD pool rehash

Erik Trimble Tue, 27 Sep 2011 14:14:10 -0700

On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:

On Tue, 27 Sep 2011, Matt Banks wrote:
Also, maybe I read it wrong, but why is it that (in the previousthread about hw raid and zpools) zpools with large numbers ofphysical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL
There is no concern with a large number of physical drives in a pool.The primary concern is with the number of drives per vdev. Anyvariation in the latency of the drives hinders performance and eachI/O to a vdev consumes 1 "IOP" across all of the drives in the vdev(or strip) when raidzN is used. Having more vdevs is better forconsistent performance and more available IOPS.
Bob

To expound just a bit on Bob's reply: the reason that large numbers ofdisks in a RAIDZ* vdev are frowned upon has to do with the fact thatIOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disksare in the actual vdev. So, the IOPS throughput of a 20-disk vdev is thesame as a 5-disk vdev. Streaming throughput is significantly higher(i.e. it scales as O(N)), but you're unlikely to get that for the vastmajority of workloads.

Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into thesituation where the time to resilver X amount of data on a 5-drive RAIDZis the same as a 30-drive RAIDZ. Given that you're highly likely tostore much more data on a larger vdev, your resilver time to replace adrive goes up linearly with the number of drives in a RAIDZ vdev.

This leads to this situation: if I have 20 x 1TB drives, here's severalpossible configurations, and the relative resilver times (relative,because without knowing the exact configuration of the data itself, Ican't estimate wall-clock-time resilver times):

(a) 5 x 4-disk RAIDZ: 15TB usable, takes N amount of time to replacea failed disk

(b)    4 x 5-disk RAIDZ:  16TB usable, takes 1.25N time to replace a disk
(c)    2 x 10-disk RAIDZ:  18TB Usable, takes 2.5N time to replace a disk
(d)    1 x 20-disk RAIDZ:    19TB usable, takes 5N time to replace a disk

Notice that by doubling the number of drives in a RAIDZ, you double theresilver time for the same amount of data in the ZPOOL.

The above also applies to RAIDZ[23], as the additional parity diskdoesn't materially impact resilver times in either direction (and, yes,it's not really a "parity disk", I'm just being sloppy).

Also, the other main reason is that larger numbers of drives in a singlevdev mean there is a higher probability that multiple disk failures willresult in loss of data. Richard Elling had some data on the exactcalculations, but it boils down to the fact that your chance of totaldata loss from multiple drive failures goes up MORE THAN LINEARLY byadding drives into a vdev. Thus, a 1x10-disk RAIDZ has well over 2x thechance of failure that 2 x 5-disk RAIDZ zpool has.


-Erik
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] All (pure) SSD pool rehash

Reply via email to