Please unsubscribe me

COLLIER


-----Original Message-----
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Adam Leventhal
Sent: Thursday, September 03, 2009 2:08 AM
To: zfs-discuss@opensolaris.org discuss
Subject: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123

Hey folk,

The are two problems with RAID-Z in builds snv_120 through snv_123 that
will both be resolved in build snv_124. The problems are as follows:

1. Data corruption on a RAID-Z system of any sort (raidz1, raidz2,  
raidz3)
can lead to spurious checksum errors being reported on devices that were
not used as part of the reconstruction.

These errors are harmless and can be cleared safely (zpool clear  
<pool>).


2. There is a far more serious problem with single-parity RAID-Z that  
can
lead to data corruption. This data corruption is recoverable as long  
as no
additional data corruption or drive failure occurs. That is to say, data
is fine provided there is not an additional problem. The problem is  
present
on all raidz1 configurations that use an odd number of children (disks)
e.g. 4+1, or 6+1. Note that raidz1 configurations with an even number of
children (e.g. 3+1), raidz2, and raidz3 are unaffected.

The recommended course of action is to roll back to build snv_119 or
earlier. If for some reason this is impossible, please email me  
PRIVATELY,
and we can discuss the best course of action for you. After rolling back
initiate a scrub. ZFS will identify and correct these errors, but if  
enough
accumulate it will (incorrectly) identify drives as faulty (which they  
likely
aren't). You can clear these failures (zpool clear <pool>).

Without rolling back, repeated scrubs will eventually remove all  
traces of
the data corruption. You may need to clear checksum failures as they're
identified to ensure that enough drives remain online.


For reference here's the bug:
   6869090 filebench on thumper with ZFS (snv_120) raidz causes  
checksum errors from all drives

Apologies for the bug and for any inconvenience this caused.

Below is a technical description of the two issues. This is for interest
only and does not contain additional discussion of symptoms or  
prescriptive
action.

Adam

---8<---

1. In situations where a block read from a RAID-Z vdev fails to checksum
but there were no errors from any of the child vdevs (e.g. hard  
drives) we
must enter combinatorial reconstruction in which we attempt every
combination of data and parity until we find the correct data. The logic
was modified to scale to triple-parity RAID-Z and in doing so I  
introduced
a bug in which spurious errors reports may in some circumstances be
generated for vdevs that were not used as part of the data  
reconstruction.
These do not represent actual corruption or problems with the underlying
devices and can be ignored and cleared.


2. This one is far subtler and requires an understanding of how RAID-Z
writes work. For that I strongly recommend the following blog post from
Jeff Bonwick:

   http://blogs.sun.com/bonwick/entry/raid_z

Basically, RAID-Z writes full stripes every time; note that without  
careful
accounting it would be possible to effectively fragment the vdev such  
that
single sectors were free but useless since single-parity RAID-Z requires
two adjacent sectors to store data (one for data, one for parity). To
address this, RAID-Z rounds up its allocation to the next (nparity + 1).
This ensures that all space is accounted for. RAID-Z will thus skip
sectors that are unused based on this rounding. For example, under  
raidz1
a write of 1024 bytes would result in 512 bytes of parity, 512 bytes of
data on two devices and 512 bytes skipped.

To improve performance, ZFS aggregates multiple adjacent IOs into a  
single
large IO. Further, hard drives themselves can perform aggregation of
adjacent IOs. We noted that these skipped sectors were inhibiting
performance so added "optional" IOs that could be used to improve
aggregation. This yielded a significant performance boost for all RAID-Z
configurations.

Another nuance of single-parity RAID-Z is that while it normally lays  
down
stripes as P D D (parity, data, data, ...), it will switch every  
megabyte
to move the parity into the second position (data, parity, data, ...).
This was ostensibly to effect the same improvement as between RAID-4 and
RAID-5 -- distributed parity. However, to implement RAID-5 actually
requires full distribution of parity AND RAID-Z already distributes  
parity
by virtue of the skipped sectors and variable width stripes. In other
words, this was not a particularly valid optimization. It was  
accordingly
discarded for double- and tripe-parity RAID-Z. They contain no such
swapping.

The implementation of this swapping was not taken into account for the
optional IOs so rather than writing the optional IO into the skipped
sector, the optional IO overwrote the first sector of the subsequent
stripe with zeros.

The aggregation does not always happen so the corruption is ususally
not pervasive. Futher, raidz1 vdevs with odd numbers of children are  
more
likely to encounter the problem. Let's say we have a raidz1 vdev with
three children. Two writes of 1K each would look like this:

            disks
          0   1   2
        _____________
        |   |   |   |                P = parity
        | P | D | D |  LBAs          D = data
        |___|___|___|   |            X = skipped sector
        |   |   |   |   |
        | X | P | D |   v
        |___|___|___|
        |   |   |   |
        | D | X |   |
        |___|___|___|

The logic for the optional IOs effectively (though not literally) in  
this
case would fill in the next LBA on the disk with a 0:

        _____________
        |   |   |   |                P = parity
        | P | D | D |  LBAs          D = data
        |___|___|___|   |            X = skipped sector
        |   |   |   |   |            0 = zero-data from aggregation
        | 0 | P | D |   v
        |___|___|___|
        |   |   |   |
        | D | X |   |
        |___|___|___|

We can see the problem when the parity undergoes the swap described  
above:

            disks
          0   1   2
        _____________
        |   |   |   |                P = parity
        | D | P | D |  LBAs          D = data
        |___|___|___|   |            X = skipped sector
        |   |   |   |   |            0 = zero-data from aggregation
        | X | 0 | P |   v
        |___|___|___|
        |   |   |   |
        | D | X |   |
        |___|___|___|

Note that the 0 incorrectly is also swapped thus inadvertently  
overwriting
a data sector in the subsequent stripe. This only occurs if there is IO
aggregation making it much more likely with small, synchronous IOs. It's
also only possible with an odd (N) number of child vdevs since to  
induce the
problem the size of the data written must consume a multiple of N-1  
sectors
_and_ the total number of sectors used for data and parity must be odd  
(to
create the need for a skipped sector).

The number of data sectors is simply size / 512 and the number of parity
sectors is ceil(size / 512 / (N-1)).

   1) size / 512 = K * (N-1)
   2) size / 512 + ceil(size / 512 / (N-1)) is odd
therefore
      K * (N-1) + K = K * N is odd

If N is even K * N cannot be odd and therefore the situation cannot  
arise.
If N is odd, it is possible to satisfy (1) and (2).

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to