On 8/10/2010 9:57 PM, Peter Taps wrote:
Hi Eric,
Thank you for your help. At least one part is clear now.
I still am confused about how the system is still functional after one disk
fails.
Consider my earlier example of 3 disks zpool configured for raidz-1. To keep it
simple let's not consider block sizes.
Let's say I send a write value "abcdef" to the zpool.
As the data gets striped, we will have 2 characters per disk.
disk1 = "ab" + some parity info
disk2 = "cd" + some parity info
disk3 = "ef" + some parity info
Now, if disk2 fails, I lost "cd." How will I ever recover this? The parity info
may tell me that something is bad but I don't see how my data will get recovered.
The only good thing is that any newer data will now be striped over two disks.
Perhaps I am missing some fundamental concept about raidz.
Regards,
Peter
Parity is not intended to tell you *if* something is bad (well, it's not
*designed* for that). It tells you how to RECONSTRUCT something should
it be bad. ZFS uses Checksums of the data (which are stored as data
themselves) to tell if some data is bad, and thus needs to be re-written
(which is what virtually no other filesystem does now). Parity is used
at a lower level to reconstruct data on devices after a device failure.
It is not directly used to determine if a device (or block of data) is bad.
To simplify, let's assume we're talking about raidz1 (the principles
generally apply to raidz2 and raidz3, but the details differ slightly).
Parity is constructed using mathematical XOR, which has the following
property:
if A XOR B = C
then
A XOR C = B and also B XOR C = A
(XOR is also fully commutative, so A XOR B = B XOR A )
So, in your case, what we have some some data "abcdef", and three disks.
So, assuming we have a stripe set up so that 1 BYTE (i.e. character)
gets stored on each device, then what you have is this:
Stripe Device 1 Device 2 Device 3
1 A B A XOR B
2 C XOR D C D
3 E E XOR F F
(where X XOR Y means the binary value computed by XOR-ing X with Y)
In any case, if I lose one of the devices above, I simply XOR the
corresponding values from the other two devices to reconstruct what I need.
For RaidZ[23], there are 2 or three parity calculations (it's not a
straight XOR, I forget the algorithm), but the process is the same - you
use the data from the remaining devices to recompute the lost device or
devices. As the parity block for a stripe is stored in a balanced manner
across all devices (there is no dedicated parity-only device), it
becomes simpler to recover data while retaining performance.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss