I have a MapReduce job running on Hadoop 2.0.0, and on some 'heavy' jobs, I am
seeing the following errors in the reducer.
2014-11-04 13:30:57,761 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
ResponseProcessor exception for block
BP-60005389-172.30.21.49-1379424439243:blk_-4575496846575688807_62439186
java.io.EOFException: Premature EOF: no length prefix available at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:695)
2014-11-04 13:30:57,842 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
for block
BP-60005389-172.30.21.49-1379424439243:blk_-4575496846575688807_62439186 in
pipeline 172.30.120.143:50010, 172.30.120.186:50010: bad datanode
172.30.120.143:50010 2014-11-04 13:33:09,707 WARN
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception
for block
BP-60005389-172.30.21.49-1379424439243:blk_-4575496846575688807_62439488
java.io.EOFException: Premature EOF: no length prefix available at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:695)
Once this error occurs, every time the code subsequently tries to write to
HDFS, it gets a different error:
java.io.IOException: All datanodes 172.30.120.193:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:960)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:780)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
This error happens for EVERY write.
btw, 172.30.21.49 is our NameNode, and 172.30.120.193 is the slave on which
this task was running.
What should I be looking at to stop this happening? Could it be a resource
contention happening somewhere? I looked at Namenode console and we have enough
disk-space.
Clearly, I want to avoid this happening, and would also like recommendation of
what to do if this does happen - currently, the exception is caught and a
counter is incremented. Maybe we should be throwing this exception up so that
the task is retried somewhere else.
Any recommendations/advice are welcome.
Thanks,
Hayden