On 5/4/2012 1:24 PM, Peter Tribble wrote:
On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey
I think you'll get better, both performance& reliability, if you break each
of those 15-disk raidz3's into three 5-disk raidz1's. Here's why:
Incorrect on reliability; see below.
Now, to put some numbers on this...
A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
sequential. This means resilvering the entire disk sequentially, including
unused space, (which is not what ZFS does) would require 2.2 hours. In
practice, on my 1T disks, which are in a mirrored configuration, I find
resilvering takes 12 hours. I would expect this to be ~4 days if I were
using 5-disk raidz1, and I would expect it to be ~12 days if I were using
Based on your use of "I would expect", I'm guessing you haven't
done the actual measurement.
I see ~12-16 hour resilver times on pools using 1TB drives in
raidz configurations. The resilver times don't seem to vary
with whether I'm using raidz1 or raidz2.
Suddenly the prospect of multiple failures overlapping don't seem so
Which is *exactly* why you need multiple-parity solutions. Put
simply, if you're using single-parity redundancy with 1TB drives
or larger (raidz1 or 2-way mirroring) then you're putting your
data at risk. I'm seeing - at a very low level, but clearly non-zero -
occasional read errors during rebuild of raidz1 vdevs, leading to
data loss. Usually just one file, so it's not too bad (and zfs will tell
you which file has been lost). And the observed error rates we're
seeing in terms of uncorrectable (and undetectable) errors from
drives are actually slightly better than you would expect from the
manufacturers spec sheets.
So you definitely need raidz2 rather than raidz1; I'm looking at
going to raidz3 for solutions using current high capacity (ie 3TB)
(On performance, I know what the theory says about getting one
disk's worth of IOPS out of each vdev in a raidz configuration. In
practice we're finding that our raidz systems actually perform
pretty well when compared with dynamic stripes, mirrors, and
hardware raid LUNs.)
Really, guys: Richard, myself, and several others have covered how ZFS
does resilvering (and on disk reliability, a related issue), and
included very detailed calculations on IOPS required and discussions
about slabs, recordsize, and how disks operate with regards to
seek/access times and OS caching.
Please search the archives, as it's not fruitful to repost the exact
same thing repeatedly.
Short version: assuming identical drives and the exact same usage
pattern and /amount/ of data, the time it takes the various ZFS
configurations to resilver is N for ANY mirrored config and a bit less
than N*M for a M-disk RAIDZ*, where M = the number of data disks in the
RAIDZ* - thus a 6-drive (total) RAIDZ2 will have the same resilver time
as a 5-drive (total) RAIDZ1. Calculating what N is depends entirely on
the pattern which the data was written on the drive. You're always
going to be IOPS-bound on the disk being resilvered.
Which RAIDZ* config to use (assuming you have a fixed tolerance for data
loss) depends entirely on what your data usage pattern does to resilver
times; configurations needing very long resilver times better have more
redundancy. And, remember, larger configs will allow for more data to be
stored, that also increases resilver time.
Oh, and a RAIDZ* will /only/ ever get you slightly more than 1 disk's
worth of IOPS (averaged over a reasonable time period). Caching may
make it appear to give more IOPS in certain cases, but that's neither
sustainable nor predictable, and the backing store is still only giving
1 disk's IOPS. The RAIDZ* may, however, give you significantly more
throughput (in MB/s) than a single disk if you do a lot of sequential
read or write.
zfs-discuss mailing list