Bob Friesenhahn wrote:
> Your idea to stripe two disks per LUN should work.  Make sure to use
> raidz2 rather than plain raidz for the extra reliability.  This
> solution is optimized for high data throughput from one user.

Striping two disks per LUN (RAID0 on 2 disks) and then adding a ZFS form of 
redundancy (either mirror or raidz[2]) would be an efficient use of space.  
There would be no additional space overhead caused by running that way.

Note, however, that if you do this, ZFS must resilver the larger LUN in the 
event of a single disk failure on the backend.  This means a longer time to 
rebuild, and a lot of "extra" work on the other (non-failed) half of the RAID0 
stripe.

> 
> An alternative is to create individual "RAID 0" LUNs which actually
> only contain a single disk.  

This is certainly preferable, since the unit of failure at the hardware level 
corresponds to the unit of resilvering at the ZFS level.  And at least on my 
Nexsan SATAboy(2f) this configuration is possible.

> Then implement the pool as two raidz2s
> with six LUNs each, and two hot spares.  That would be my own
> preference.  Due to ZFS's load share this should provide better
> performance (perhaps 2X) for multi-user loads.  Some testing may be
> required to make sure that your hardware is happy with this.

I disagree with this suggestion.  With this config, you only get 8 disks worth 
of storage, out of the 14, which is a ~42% overhead.  In order to lose data in 
this scenario, 3 disks would have to fail out of a single 6-disk group before 
zfs is able to resilver any of them to the hot spares.  That seems (to me) a 
lot more redundancy than is needed.

As far as workload, any time you use RAIDZ[2], ZFS must read the entire stripe 
(across all of the disks) in order to verify the checksum for that data block.  
This means that a 128k read (the default zfs blocksize) requires a 32kb read 
from each of 6 disks, which may include a relatively slow seek to the relevant 
part of the spinning rust.  So for random I/O, even though the data is striped 
across all the disks, you will see only a single disks's worth of throughput.  
For sequential I/O, you'll see the full RAID set's worth of throughput.

If you are expecting a non-sequential workload, you would be better off taking 
the 50% storage overhead to do ZFS mirroring.

> 
> Avoid RAID5 if you can because it is not as reliable with today's
> large disks and the resulting huge LUN size can take a long time to
> resilver if the RAID5 should fail (or be considered to have failed).

Here's a place that ZFS shines: it doesn't resilver the whole disk, just the 
data blocks.  So it doesn't have to read the full array to rebuild a failed 
disk, so it's less likely to cause a subsequent failure during parity rebuild.

My $.02.

--Joe
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to