On 8/10/2010 9:57 PM, Peter Taps wrote:
Hi Eric,

Thank you for your help. At least one part is clear now.

I still am confused about how the system is still functional after one disk 
fails.

Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it 
simple let's not consider block sizes.

Let's say I send a write value "abcdef" to the zpool.

As the data gets striped, we will have 2 characters per disk.

disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info

Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info 
may tell me that something is bad but I don't see how my data will get recovered.

The only good thing is that any newer data will now be striped over two disks.

Perhaps I am missing some fundamental concept about raidz.

Regards,
Peter

Parity is not intended to tell you *if* something is bad (well, it's not *designed* for that). It tells you how to RECONSTRUCT something should it be bad. ZFS uses Checksums of the data (which are stored as data themselves) to tell if some data is bad, and thus needs to be re-written (which is what virtually no other filesystem does now). Parity is used at a lower level to reconstruct data on devices after a device failure. It is not directly used to determine if a device (or block of data) is bad.


To simplify, let's assume we're talking about raidz1 (the principles generally apply to raidz2 and raidz3, but the details differ slightly).


Parity is constructed using mathematical XOR, which has the following property:

if A XOR B = C
then
    A XOR C = B    and also    B XOR C = A

(XOR is also fully commutative, so A XOR B = B XOR A )


So, in your case, what we have some some data "abcdef", and three disks. So, assuming we have a stripe set up so that 1 BYTE (i.e. character) gets stored on each device, then what you have is this:

Stripe       Device 1     Device 2     Device 3
1            A            B            A XOR B
2            C XOR D      C            D
3            E            E XOR F      F


(where X XOR Y means the binary value computed by XOR-ing X with Y)

In any case, if I lose one of the devices above, I simply XOR the corresponding values from the other two devices to reconstruct what I need.



For RaidZ[23], there are 2 or three parity calculations (it's not a straight XOR, I forget the algorithm), but the process is the same - you use the data from the remaining devices to recompute the lost device or devices. As the parity block for a stripe is stored in a balanced manner across all devices (there is no dedicated parity-only device), it becomes simpler to recover data while retaining performance.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to