This is a really curious case. How many replicas of each block do you have?
Are you able to copy the data directly using HDFS client? You could try the hadoop fs -copyToLocal command and see if it can copy the data from hdfs correctly. That would help you verify that the issue really is at HDFS layer (though it does look like that from the stack trace). Which file format are you using? Thanks Vaibhav -----Original Message----- From: W S Chung [mailto:[email protected]] Sent: Friday, August 19, 2011 3:26 PM To: [email protected] Subject: org.apache.hadoop.fs.ChecksumException: Checksum error: For some reason, my questions sent two days ago again never shows up, even though I can google the question. I apologize if you have seen this question before. After loading around 2G or so data in a few files into hive, the "select count(*) from table" query keep failing. The JobTracker UI gives the following error: org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat at 51794944 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66) at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:159) fsck reports there are courrpted blocks. I have tried dropping the table and reload a few time. As far as I can see, the behavior is somewhat different every time, in the sense of how many corrupted blocks and how many files I loaded before the corrupted blocks appear. Sometimes the corrupted blocks show up after the data is loaded and sometimes only after the "select count(*)" query is made. I have tried setting "io.skip.checksum.errors" to true, but has no effect at all. I know that checksum error is usually an indication of hardware problem. But we are running hive on NFS cluster and has ECC memory. Our system admin here is not willing to believe that our high quality hardware has so many issues. I did try installing a simpler single node hive on another machine and the problem does not appear in this install after the data is loaded. Can someone give me some pointers in what else to try? Thanks.
