Re: org.apache.hadoop.fs.ChecksumException: Checksum error:

W S Chung Mon, 22 Aug 2011 08:04:12 -0700

I try using hadoop fs -copyToLocal. I also get a stack trace, like this:

11/08/22 10:53:57 INFO fs.FSInputChecker: Found checksum error:
b[1024, 
1536]=31325431393a32313a31315a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304431374635383833344135464336423341424646357c342e327c313931393638200a323031312d30352d31325431393a32313a31315a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304031374635383833344135464336423341424646357c342e322e317c313931393638200a323031312d30352d31325431393a32323a33395a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304431374635383833344135464336423341424646357c362e322e317c313837373837200a323031312d30352d31325431393a32323a34335a7c3137342e3235332e3234352e3232377c39376261623664642d353062342d343461612d383235642d6537336238646434336563337c36373842303935453945304431374635383833344135464336423341424646357c362e337c3138373738375f61745f706f736974696f6e5f3835200a323031312d30352d31325431393a32323a34335a7c3137342e
org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_2722854101062410251:of:/user/hive/warehouse/att_log/collect_time=1314024490064/load.dat
at 64635904
        at 
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
        at 
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at 
org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1158)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1718)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1770)
        at java.io.DataInputStream.read(DataInputStream.java:83)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:53)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:72)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:320)
        at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:248)
        at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:199)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1754)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1880)
11/08/22 10:53:57 WARN hdfs.DFSClient: Found Checksum error for
blk_2722854101062410251_1038 from 192.168.50.192:50010 at 64635904
11/08/22 10:53:57 INFO hdfs.DFSClient: Could not obtain block
blk_2722854101062410251_1038 from any node:  java.io.IOException: No
live nodes contain current block
copyToLocal: Checksum error:
/blk_2722854101062410251:of:/user/hive/warehouse/att_log/collect_time=1314024490064/load.dat
at 64635904



I manage to load two files(by using the Java API copyFromLocal call
and then a 'load data inpath' call to load the data into the table).
hadoop fsck does not show corrupted block until I run the 'select
count(*)' call after loading the second file. 'hadoop fs -copyToLocal'
also only fails after hadoop fsck shows corrupted block. For the first
loaded file, 'hadoop fs -copyToLocal' works fine. It does look like
the problem is with hdfs.

I originally discover this issue on a two-node cluster with a
replication factor of 2. But I am now testing on a pseudo-distributed
install with only one node and a replication factor of 1.

I am using text file. I would like to try to use sequencefile. I
understand the "io.skip.checksum.errors" setting only applies to
sequencefile. But the only way I know to load data into a table with
sequencefile as storage is to first load the text file into a table
with textfile as storage and then use a 'insert into select' to load
the data into the sequencefile table. The 'insert into select' already
fails with the same problem as running a query on the textfile table.
Is there any other way to load a sequencefile table?



On Fri, Aug 19, 2011 at 8:57 PM, Aggarwal, Vaibhav <[email protected]> wrote:
> This is a really curious case.
>
> How many replicas of each block do you have?
>
> Are you able to copy the data directly using HDFS client?
> You could try the hadoop fs -copyToLocal command and see if it can copy the 
> data from hdfs correctly.
>
> That would help you verify that the issue really is at HDFS layer (though it 
> does look like that from the stack trace).
>
> Which file format are you using?
>
> Thanks
> Vaibhav
>
> -----Original Message-----
> From: W S Chung [mailto:[email protected]]
> Sent: Friday, August 19, 2011 3:26 PM
> To: [email protected]
> Subject: org.apache.hadoop.fs.ChecksumException: Checksum error:
>
> For some reason, my questions sent two days ago again never shows up, even 
> though I can google the question. I apologize if you have seen this question 
> before.
>
> After loading around 2G or so data in a few files into hive, the "select 
> count(*) from table" query keep failing. The JobTracker UI gives the 
> following error:
>
>     org.apache.hadoop.fs.ChecksumException: Checksum error:
> /blk_8155249261522439492:of:/user/hive/warehouse/att_log/collect_time=1313592519963/load.dat
> at 51794944
>       at 
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
>       at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
>       at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
>       at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
>       at 
> org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1660)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2257)
>       at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2307)
>       at java.io.DataInputStream.read(DataInputStream.java:83)
>       at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>       at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>       at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>       at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:66)
>       at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:32)
>       at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:67)
>       at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>       at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>       at org.apache.hadoop.mapred.Child.main(Child.java:159)
>
> fsck reports there are courrpted blocks. I have tried dropping the table and 
> reload a few time. As far as I can see, the behavior is somewhat different 
> every time, in the sense of how many corrupted blocks and how many files I 
> loaded before the corrupted blocks appear.
> Sometimes the corrupted blocks show up after the data is loaded and sometimes 
> only after the "select count(*)" query is made. I have tried setting 
> "io.skip.checksum.errors" to true, but has no effect at all.
>
> I know that checksum error is usually an indication of hardware problem. But 
> we are running hive on NFS cluster and has ECC memory.
> Our system admin here is not willing to believe that our high quality 
> hardware has so many issues. I did try installing a simpler single node hive 
> on another machine and the problem does not appear in this install after the 
> data is loaded. Can someone give me some pointers in what else to try? Thanks.
>

Re: org.apache.hadoop.fs.ChecksumException: Checksum error:

Reply via email to