Hi, I am using Hadoop 0.20.1. Recently we had a JobTracker outage because of the following:
JobTracker tries to write a file to HDFS but it's connection to primary datanode gets disrupted. It then subsequently enters into retry loop (that goes on for hours). I see the the following message in jobtracker: 2011-05-05 10:14:44,117 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-3114565976339273197_13989812java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.216.48.12:55432 remote=/10.216.241.26:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397) 2011-05-05 10:14:44,117 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-3114565976339273197_13989812 bad datanode[0] 10.216.241.26:50010 2011-05-05 10:14:44,117 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-3114565976339273197_13989812 in pipeline 10.216.241.26:50010, 10.193.31.55:50010, 10.193.31.54:50010: bad datanode 10.216.241.26:50010 2011-05-05 10:15:32,458 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /progress/_logs/history/hadoop.jobtracker.com_1299948850437_job_201103121654_161356_user_myJob retrying... The last message that I see in namenode regarding this block is: 011-05-05 10:15:27,208 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_-3114565976339273197_13989812, newgenerationstamp=13989830, newlength=260096, newtargets=[10.193.31.54:50010], closeFile=false, deleteBlock=false) This problem looks similar to what these guys experienced here: https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/f02a5a08e50de544 Any ideas ? Thanks, -Rakesh