I have a MapReduce job running on Hadoop 2.0.0, and on some 'heavy' jobs, I am 
seeing the following errors in the reducer. 


2014-11-04 13:30:57,761 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream 
ResponseProcessor exception  for block 
BP-60005389-172.30.21.49-1379424439243:blk_-4575496846575688807_62439186 
java.io.EOFException: Premature EOF: no length prefix available  at 
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
   at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
    at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:695)
 2014-11-04 13:30:57,842 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block 
BP-60005389-172.30.21.49-1379424439243:blk_-4575496846575688807_62439186 in 
pipeline 172.30.120.143:50010, 172.30.120.186:50010: bad datanode 
172.30.120.143:50010 2014-11-04 13:33:09,707 WARN 
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  
for block 
BP-60005389-172.30.21.49-1379424439243:blk_-4575496846575688807_62439488 
java.io.EOFException: Premature EOF: no length prefix available        at 
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
   at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
    at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:695)
 

Once this error occurs, every time the code subsequently tries to write to 
HDFS, it gets a different error:

java.io.IOException: All datanodes 172.30.120.193:50010 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:960)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:780)
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)

This error happens for EVERY write.

btw, 172.30.21.49 is our NameNode, and 172.30.120.193 is the slave on which 
this task was running.

What should I be looking at to stop this happening? Could it be a resource 
contention happening somewhere? I looked at Namenode console and we have enough 
disk-space. 

Clearly, I want to avoid this happening, and would also like recommendation of 
what to do if this does happen - currently, the exception is caught and a 
counter is incremented. Maybe we should be throwing this exception up so that 
the task is retried somewhere else.

Any recommendations/advice are welcome.

Thanks,
Hayden

Reply via email to