Recovering from corrupt blocks in HFile

Mike Dillon Tue, 17 Mar 2015 17:06:13 -0700

Hi all-

I've got an HFile that's reporting a corrupt block in "hadoop fsck" and was
hoping to get some advice on recovering as much data as possible.


When I examined the blk-* file on the three data nodes that have a replica
of the affected block, I saw that the replicas on two of the datanodes had
the same SHA-1 checksum and that the replica on the other datanode was a
truncated version of the replica found on the other nodes (as reported by a
difference at EOF by "cmp"). The size of the two identical blocks is
67108864, the same as most of the other blocks in the file.

Given that there were two datanodes with the same data and another with
truncated data, I made a backup of the truncated file and dropped the
full-length copy of the block in its place directly on the data mount,
hoping that this would cause HDFS to no longer report the file as corrupt.
Unfortunately, this didn't seem to have any effect.

Looking through the Hadoop source code, it looks like there is a
CorruptReplicasMap internally that tracks which nodes have "corrupt" copies
of a block. In HDFS-6663 <https://issues.apache.org/jira/browse/HDFS-6663>,
a "-blockId" parameter was added to "hadoop fsck" to allow dumping the
reason that a block ids is considered corrupt, but that wasn't added until
Hadoop 2.7.0 and our client is running 2.0.0-cdh4.6.0.

I also had a look at running the "HFile" tool on the affected file (cf.
section 9.7.5.2.2 at http://hbase.apache.org/0.94/book/regions.arch.html).
When I did that, I was able to see the data up to the corrupted block as
far as I could tell, but then it started repeatedly looping back to the
first row and starting over. I believe this is related to the behavior
described in https://issues.apache.org/jira/browse/HBASE-12949

My goal is to determine whether the block in question is actually corrupt
and, if so, in what way. If it's possible to recover all of the file except
a portion of the affected block, that would be OK too. I just don't want to
be in the position of having to lose all 3 gigs of data in this particular
region, given that most of it appears to be intact. I just can't find the
right low-level tools to let me determine the diagnose the exact state and
structure of the block data I have for this file.

Any help or direction that someone could provide would be much appreciated.
For reference, I'll repeat that our client is running Hadoop 2.0.0-cdh4.6.0
and add that the HBase version is 0.94.15-cdh4.6.0.

Thanks!

-md

Recovering from corrupt blocks in HFile

Reply via email to