Hi,

I am using Hadoop 0.20.1. Recently we had a JobTracker outage because of the 
following:

JobTracker tries to write a file to HDFS but it's connection to primary 
datanode gets disrupted. It then subsequently enters into retry loop (that goes 
on for hours).

I see the the following message in jobtracker:



2011-05-05 10:14:44,117 WARN
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block
blk_-3114565976339273197_13989812java.net.SocketTimeoutException: 69000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.216.48.12:55432
remote=/10.216.241.26:50010]

       
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)

       
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)

       
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)

       
at java.io.DataInputStream.readFully(DataInputStream.java:178)

       
at java.io.DataInputStream.readLong(DataInputStream.java:399)

       
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397)

 

2011-05-05 10:14:44,117 WARN
org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_-3114565976339273197_13989812 bad datanode[0] 10.216.241.26:50010

2011-05-05 10:14:44,117 WARN
org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_-3114565976339273197_13989812 in pipeline 10.216.241.26:50010,
10.193.31.55:50010, 10.193.31.54:50010: bad datanode 10.216.241.26:50010

2011-05-05
10:15:32,458 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file 
/progress/_logs/history/hadoop.jobtracker.com_1299948850437_job_201103121654_161356_user_myJob
retrying...

The last message that I see in namenode regarding this block is:



011-05-05 10:15:27,208 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
commitBlockSynchronization(lastblock=blk_-3114565976339273197_13989812,
newgenerationstamp=13989830, newlength=260096, newtargets=[10.193.31.54:50010],
closeFile=false, deleteBlock=false)



This problem looks similar to what these guys experienced here: 
https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/f02a5a08e50de544

Any ideas ?

Thanks,
-Rakesh


                                          

Reply via email to