On Mar 21, 2012, at 10:40 AM, Marion Hakanson wrote:
> p...@kraus-haus.org said:
>> Without knowing the I/O pattern, saying 500 MB/sec. is meaningless.
>> Achieving 500MB/sec. with 8KB files and lots of random accesses is really
>> hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of
>> 100MB+ files is much easier.
>> . . .
>> For ZFS, performance is proportional to the number of vdevs NOT the
>> number of drives or the number of drives per vdev. See https://
>> Xc for some testing I did a while back. I did not test sequential read as
>> that is not part of our workload.
Actually, few people have sequential workloads. Many think they do, but I say
prove it with iopattern.
>> . . .
>> I understand why the read performance scales with the number of vdevs,
>> but I have never really understood _why_ it does not also scale with the
>> number of drives in each vdev. When I did my testing with 40 dribves, I
>> expected similar READ performance regardless of the layout, but that was NOT
>> the case.
> In your first paragraph you make the important point that "performance"
> is too ambiguous in this discussion. Yet in the 2nd & 3rd paragraphs above,
> you go back to using "performance" in its ambiguous form. I assume that
> by "performance" you are mostly focussing on random-read performance....
> My experience is that sequential read performance _does_ scale with the number
> of drives in each vdev. Both sequential and random write performance also
> scales in this manner (note that ZFS tends to save up small, random writes
> and flush them out in a sequential batch).
I wrote a small, random read performance model that considers the various
It is described here:
The spreadsheet shown in figure 3 is available for the asking (and it works on
iphone or ipad :-)
> Small, random read performance does not scale with the number of drives in
> raidz vdev because of the dynamic striping. In order to read a single
> logical block, ZFS has to read all the segments of that logical block, which
> have been spread out across multiple drives, in order to validate the checksum
> before returning that logical block to the application. This is why a single
> vdev's random-read performance is equivalent to the random-read performance of
> a single drive.
It is not as bad as that. The actual worst case number for a HDD with
of one is:
average IOPS * ((D+P) / D)
D = number of data vdevs
P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3)
total disks per set = D + P
We did many studies that verified this. More recent studies show
has a huge impact on average latency of HDDs, which I also described in my talk
OpenStorage Summit last fall.
> p...@kraus-haus.org said:
>> The recommendation is to not go over 8 or so drives per vdev, but that is
>> a performance issue NOT a reliability one. I have also not been able to
>> duplicate others observations that 2^N drives per vdev is a magic number (4,
>> 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is
>> reliable, just (relatively) slow :-)
Paul, I have a considerable amount of data that refutes your findings. Can we
that YMMV and varies dramatically, depending on your workload?
> Again, the "performance issue" you describe above is for the random-read
> case, not sequential. If you rarely experience small-random-read workloads,
> then raidz* will perform just fine. We often see 2000 MBytes/sec sequential
> read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's
> (using 2TB drives).
Yes, this is relatively easy to see. I've seen 6GByes/sec for large configs, but
that begins to push the system boundaries in many ways.
> However, when a disk fails and must be resilvered, that's when you will
> run into the slow performance of the small, random read workload. This
> is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives,
> especially of the 1TB+ size. That way if it takes 200 hours to resilver,
> you've still got a lot of redundancy in place.
> zfs-discuss mailing list
DTrace Conference, April 3, 2012,
ZFS Performance and Training
zfs-discuss mailing list