more below...

On 2012-12-06 03:06, Jim Klimov wrote:
It also happens that on disks 1,2,3 the first row's sectors (d0, d2, d3)
are botched - ranges from 0x9C0 to 0xFFF (end of 4KB sector) are zeroes.

The neighboring blocks, located a few sectors away from this one, also
have compressed data and have some regular-looking patterns of bytes,
certainly no long stretches of zeroes.

However, the byte-by-byte XOR matching complains about the whole sector.
All bytes, except some 40 single-byte locations here and there, don't
XOR up to produce the expected (known from disk) value.

I did not yet try the second parity algorithm.

At least in this case, it does not seem that I would find an incantation
needed to recover this block - too many zeroes overlapping (at least 3
disks' data proven compromised), where I did hope for some shortcoming
in ZFS recombination exhaustiveness. In this case - it is indeed too
much failure to handle.

Now waiting for scrub to find me more test subjects - broken files ;)
So, these findings from my first tested bad file remain valid.
Now that I have a couple more error locations found again by
scrub (which for the past week progressed just above 50% of
the pool), there are some more results.

So far only one location has random-looking different data in
the sectors of the block on different disks, which I might at
least try to salvage as described in the beginning of this thread.

In two of three cases, some of the sectors (in the range which
mismatches the parity data) are not only clearly invalid, like
being filled with long stretches of zeroes with other sectors
being uniformly-looking binary data (results of compression).
Moreover, several of these sectors (4096-bytes long at same
offsets on different drives which are data components of the
same block) are literally identical, which is apparently some
error upon write (perhaps, some noise was interpreted by several
disks at once like a command for them to write at that location).

The corrupted area looks like a series of "0xFC 0x42" bytes about
half a kilobyte long, followed by zero bytes to the end of sector.
Start of this area is not aligned to a multiple of 512 bytes.

These disks being of an identical model and firmware, I am ready
to believe that they might misinterpret same interference in the
same way. However, I was under the impression that SATA involved
CRCs on commands and data in the protocol - to counter the noise?..

Question: does such conclusion sound like a potentially possible
explanation for my data corruptions (on disks which passed dozens
of scrubs successfully before developing these problems nearly at
once in about ten locations)?

Thanks for attention,
//Jim Klimov

zfs-discuss mailing list

Reply via email to