This is a really curious case.

How many replicas of each block do you have?

Are you able to copy the data directly using HDFS client?
You could try the hadoop fs -copyToLocal command and see if it can copy the 
data from hdfs correctly.

That would help you verify that the issue really is at HDFS layer (though it 
does look like that from the stack trace).

Which file format are you using?

Thanks
Vaibhav

-----Original Message-----
From: W S Chung [mailto:[email protected]] 
Sent: Friday, August 19, 2011 3:26 PM
To: [email protected]
Subject: org.apache.hadoop.fs.ChecksumException: Checksum error:

For some reason, my questions sent two days ago again never shows up, even 
though I can google the question. I apologize if you have seen this question 
before.

After loading around 2G or so data in a few files into hive, the "select 
count(*) from table" query keep failing. The JobTracker UI gives the following 
error:

     org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat
at 51794944
       at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
       at 
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
       at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
       at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
       at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
       at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
       at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
       at java.io.DataInputStream.read(DataInputStream.java:83)
       at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
       at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
       at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
       at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66)
       at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32)
       at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
       at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
       at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
       at org.apache.hadoop.mapred.Child.main(Child.java:159)

fsck reports there are courrpted blocks. I have tried dropping the table and 
reload a few time. As far as I can see, the behavior is somewhat different 
every time, in the sense of how many corrupted blocks and how many files I 
loaded before the corrupted blocks appear.
Sometimes the corrupted blocks show up after the data is loaded and sometimes 
only after the "select count(*)" query is made. I have tried setting 
"io.skip.checksum.errors" to true, but has no effect at all.

I know that checksum error is usually an indication of hardware problem. But we 
are running hive on NFS cluster and has ECC memory.
Our system admin here is not willing to believe that our high quality hardware 
has so many issues. I did try installing a simpler single node hive on another 
machine and the problem does not appear in this install after the data is 
loaded. Can someone give me some pointers in what else to try? Thanks.

Reply via email to