> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Ray Van Dolson > > System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. 16 vdevs of > 15 > disks each -- RAIDZ3. NexentaStor 3.1.2.

I think you'll get better, both performance & reliability, if you break each of those 15-disk raidz3's into three 5-disk raidz1's. Here's why: Obviously, with raidz3, if any 3 of 15 disks fail, you're still in operation, and on the 4th failure, you're toast. Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation, and on the 2nd failure, you're toast. So it's all about computing the probability of 4 overlapping failures in the 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1. In order to calculate that, you need to estimate the time to resilver any one failed disk... In ZFS, suppose you have a record of 128k, and suppose you have a 2-way mirror vdev. Then each disk writes 128k. If you have a 3-disk raidz1, then each disk writes 64k. If you have a 5-disk raidz1, then each disk writes 32k. If you have a 15-disk raidz3, then each disk writes 10.6k. Assuming you have a machine in production, and you are doing autosnapshots. And your data is volatile. Over time, it serves to fragment your data, and after a year or two of being in production, your resilver will be composed almost entirely of random IO. Each of the non-failed disks must read their segment of the stripe, in order to reconstruct the data that will be written to the new good disk. If you're in the 15-disk raidz3 configuration... Your segment size is approx 3x smaller, which means approx 3x more IO operations. Another way of saying that... Assuming the amount of data you will write to your pool is the same regardless of which architecture you chose... For discussion purposes, let's say you write 3T to your pool. And let's momentarily assume you whole pool will be composed of 15 disks, in either a single raidz3, or in 3x 5-disk raidz1. If you use one big raidz3, then the 3T will require at least 24million 128k records to hold it all, and each 128k record will be divided up onto all the disks. If you use the smaller raidz1, then only 1T will get written to each vdev, and you will only need 8million records on each disk. Thus, to resilver the large vdev, you will require 3x more IO operations. Worse still, on each IO request, you have to wait for the slowest of all disks to return. If you were in a 2-way mirror situation, your seek time would be the average seek time of a single disk. But if you were in an infinite-disk situation, your seek time would be the worst case seek time on every single IO operation, which is about 2x longer than the average seek time. So not only do you have 3x more seeks to perform, you have up to 2x longer to wait upon each seek... Now, to put some numbers on this... A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write sequential. This means resilvering the entire disk sequentially, including unused space, (which is not what ZFS does) would require 2.2 hours. In practice, on my 1T disks, which are in a mirrored configuration, I find resilvering takes 12 hours. I would expect this to be ~4 days if I were using 5-disk raidz1, and I would expect it to be ~12 days if I were using 15-disk raidz3. Your disks are all 2T, so you should double all the times I just wrote. Your raidz3 should be able to resilver a single disk in approx 24 days. Your raidz5 should be able to do one in ~ 8 days. If you were using mirrors, ~ 1 day. Suddenly the prospect of multiple failures overlapping don't seem so unlikely. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss