Erik Trimble wrote: > On 8/10/2010 9:57 PM, Peter Taps wrote: > > Hi Eric, > > > > Thank you for your help. At least one part is clear > now. > > > > I still am confused about how the system is still > functional after one disk fails. > > > > Consider my earlier example of 3 disks zpool > configured for raidz-1. To keep it simple let's not > consider block sizes. > > > > Let's say I send a write value "abcdef" to the > zpool. > > > > As the data gets striped, we will have 2 characters > per disk. > > > > disk1 = "ab" + some parity info > > disk2 = "cd" + some parity info > > disk3 = "ef" + some parity info > > > > Now, if disk2 fails, I lost "cd." How will I ever > recover this? The parity info may tell me that > something is bad but I don't see how my data will get > recovered. > > > > The only good thing is that any newer data will now > be striped over two disks. > > > > Perhaps I am missing some fundamental concept about > raidz. > > > > Regards, > > Peter > > Parity is not intended to tell you *if* something is > bad (well, it's not > *designed* for that). It tells you how to RECONSTRUCT > something should > it be bad. ZFS uses Checksums of the data (which are > stored as data > themselves) to tell if some data is bad, and thus > needs to be re-written
To follow up Erik's post, parity is used both to detect and correct errors in a string of equal sized numbers, each parity is equal in size to each of the numbers. In the old serial protocols, one bit was used to detect an error in a string of 7 bits, so each "number" in the string was a one bit. In the case of ZFS, each "number" in the string is a disk block. The length of the string of numbers is completely arbitrary. I am rusty on parity math, but Reed-Solomon is used (of which XOR is a degenerate case) such that each parity is independent of the other parities. RAIDZ can support up to three parities per stripe. Generally, a single parity can either detect a single corrupt number in a string or if it is known which number is corrupt, a single parity can correct that number. Traditional RAID5 makes the assumption that it knows which number (i.e. block) is bad because the disk failed and therefore can use the parity block to reconstruct it. RAID5 cannot reconstruct a random bit-flip. RAIDZ takes a different approach where the checksum for the number string (i.e. stripe) exists in a different, already validated stripe. With that checksum in hand, ZFS knows when a stripe is corrupt but not which block. ZFS will then reconstruct each data block in the stripe using the parity block, one data block at a time until the checksum matches. At that point ZFS knows which block is bad and can rebuild it and write it to disk. A scrub does this for all stripes and all parities in each stripe. Using the example above, the disk layout would look more like the following for a single stripe, and as Erik mentioned, the location of the data and parity blocks will change from stripe to stripe: disk1 = "ab" disk2 = "cd" disk3 = parity info Again using the example above, if disk 2 fails, or even stays online but producess bad data, the information can be reconstructed from disk 3. The beauty of ZFS is that it does not depend on parity to detect errors, your stripes can be as wide as you want (up to 100-ish devices) and you can choose 1, 2 or 3 parity devices. Hope that makes sense, Marty -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss