Please unsubscribe me COLLIER
-----Original Message----- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Adam Leventhal Sent: Thursday, September 03, 2009 2:08 AM To: zfs-discuss@opensolaris.org discuss Subject: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123 Hey folk, The are two problems with RAID-Z in builds snv_120 through snv_123 that will both be resolved in build snv_124. The problems are as follows: 1. Data corruption on a RAID-Z system of any sort (raidz1, raidz2, raidz3) can lead to spurious checksum errors being reported on devices that were not used as part of the reconstruction. These errors are harmless and can be cleared safely (zpool clear <pool>). 2. There is a far more serious problem with single-parity RAID-Z that can lead to data corruption. This data corruption is recoverable as long as no additional data corruption or drive failure occurs. That is to say, data is fine provided there is not an additional problem. The problem is present on all raidz1 configurations that use an odd number of children (disks) e.g. 4+1, or 6+1. Note that raidz1 configurations with an even number of children (e.g. 3+1), raidz2, and raidz3 are unaffected. The recommended course of action is to roll back to build snv_119 or earlier. If for some reason this is impossible, please email me PRIVATELY, and we can discuss the best course of action for you. After rolling back initiate a scrub. ZFS will identify and correct these errors, but if enough accumulate it will (incorrectly) identify drives as faulty (which they likely aren't). You can clear these failures (zpool clear <pool>). Without rolling back, repeated scrubs will eventually remove all traces of the data corruption. You may need to clear checksum failures as they're identified to ensure that enough drives remain online. For reference here's the bug: 6869090 filebench on thumper with ZFS (snv_120) raidz causes checksum errors from all drives Apologies for the bug and for any inconvenience this caused. Below is a technical description of the two issues. This is for interest only and does not contain additional discussion of symptoms or prescriptive action. Adam ---8<--- 1. In situations where a block read from a RAID-Z vdev fails to checksum but there were no errors from any of the child vdevs (e.g. hard drives) we must enter combinatorial reconstruction in which we attempt every combination of data and parity until we find the correct data. The logic was modified to scale to triple-parity RAID-Z and in doing so I introduced a bug in which spurious errors reports may in some circumstances be generated for vdevs that were not used as part of the data reconstruction. These do not represent actual corruption or problems with the underlying devices and can be ignored and cleared. 2. This one is far subtler and requires an understanding of how RAID-Z writes work. For that I strongly recommend the following blog post from Jeff Bonwick: http://blogs.sun.com/bonwick/entry/raid_z Basically, RAID-Z writes full stripes every time; note that without careful accounting it would be possible to effectively fragment the vdev such that single sectors were free but useless since single-parity RAID-Z requires two adjacent sectors to store data (one for data, one for parity). To address this, RAID-Z rounds up its allocation to the next (nparity + 1). This ensures that all space is accounted for. RAID-Z will thus skip sectors that are unused based on this rounding. For example, under raidz1 a write of 1024 bytes would result in 512 bytes of parity, 512 bytes of data on two devices and 512 bytes skipped. To improve performance, ZFS aggregates multiple adjacent IOs into a single large IO. Further, hard drives themselves can perform aggregation of adjacent IOs. We noted that these skipped sectors were inhibiting performance so added "optional" IOs that could be used to improve aggregation. This yielded a significant performance boost for all RAID-Z configurations. Another nuance of single-parity RAID-Z is that while it normally lays down stripes as P D D (parity, data, data, ...), it will switch every megabyte to move the parity into the second position (data, parity, data, ...). This was ostensibly to effect the same improvement as between RAID-4 and RAID-5 -- distributed parity. However, to implement RAID-5 actually requires full distribution of parity AND RAID-Z already distributes parity by virtue of the skipped sectors and variable width stripes. In other words, this was not a particularly valid optimization. It was accordingly discarded for double- and tripe-parity RAID-Z. They contain no such swapping. The implementation of this swapping was not taken into account for the optional IOs so rather than writing the optional IO into the skipped sector, the optional IO overwrote the first sector of the subsequent stripe with zeros. The aggregation does not always happen so the corruption is ususally not pervasive. Futher, raidz1 vdevs with odd numbers of children are more likely to encounter the problem. Let's say we have a raidz1 vdev with three children. Two writes of 1K each would look like this: disks 0 1 2 _____________ | | | | P = parity | P | D | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | | X | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| The logic for the optional IOs effectively (though not literally) in this case would fill in the next LBA on the disk with a 0: _____________ | | | | P = parity | P | D | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | 0 = zero-data from aggregation | 0 | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| We can see the problem when the parity undergoes the swap described above: disks 0 1 2 _____________ | | | | P = parity | D | P | D | LBAs D = data |___|___|___| | X = skipped sector | | | | | 0 = zero-data from aggregation | X | 0 | P | v |___|___|___| | | | | | D | X | | |___|___|___| Note that the 0 incorrectly is also swapped thus inadvertently overwriting a data sector in the subsequent stripe. This only occurs if there is IO aggregation making it much more likely with small, synchronous IOs. It's also only possible with an odd (N) number of child vdevs since to induce the problem the size of the data written must consume a multiple of N-1 sectors _and_ the total number of sectors used for data and parity must be odd (to create the need for a skipped sector). The number of data sectors is simply size / 512 and the number of parity sectors is ceil(size / 512 / (N-1)). 1) size / 512 = K * (N-1) 2) size / 512 + ceil(size / 512 / (N-1)) is odd therefore K * (N-1) + K = K * N is odd If N is even K * N cannot be odd and therefore the situation cannot arise. If N is odd, it is possible to satisfy (1) and (2). -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss