Check your switch/link/uplink utilization. HDFS-941 might help. That is not in Hadoop 1.0 according to a cursory search over branch history in the Git mirror.
As another datapoint, we see this in our production with a Hadoop that is much closer to CDH3; but, we have some known issues with the network design in our legacy datacenters and plan to resolve it with an eventual relocation. I'm also integrating HDFS-941. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Mikael Sitruk <[email protected]> > To: [email protected] > Cc: > Sent: Thursday, February 23, 2012 1:25 PM > Subject: Exception in hbase 0.92. with DFS, - Bad connect ack > > Hi > > I see that i have in my hbase logs a lot of the following (target IP is > changing) > 2012-02-23 23:04:02,699 INFO org.apache.hadoop.hdfs.DFSClient: Exception in > createBlockOutputStream 10.232.83.87:50010 java.io.IOException: Bad connect > ack with firstBadLink as 10.232.83.118:50010 > 2012-02-23 23:04:02,699 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > block blk_4678388308309640326_170570 > 2012-02-23 23:04:02,701 INFO org.apache.hadoop.hdfs.DFSClient: Excluding > datanode 10.232.83.118:50010 > > Then checking the hdfs log of the same server (87) > 2012-02-23 23:04:02,698 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > blk_4678388308309640326_170570 received exception > java.net.SocketTimeoutException: 66000 millis timeout while waiting for > channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending remote=/ > 10.232.83.118:50010] > 2012-02-23 23:04:02,699 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.232.83.87:50010, > storageID=DS-1257662823-10.232.83.87-50010-1329398253085, infoPort=50075, > ipcPort=50020):DataXceiver > java.net.SocketTimeoutException: 66000 millis timeout while waiting for > channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending remote=/ > 10.232.83.118:50010] > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:319) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) > at java.lang.Thread.run(Thread.java:662) > > > Looking at the target (118) server hdfs log does not seems to show any > problem around the same time. > 2012-02-23 23:04:01,648 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.232.83.118:45623, dest: /10.232.83.118:50010, bytes: 67108864, op: > HDFS_WRITE, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: > 0, srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: > blk_-1747243057136009792_170577, duration: 6932047000 > 2012-02-23 23:04:01,649 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > block blk_-1747243057136009792_170577 terminating > 2012-02-23 23:04:01,656 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-4467275870825484381_170577 src: /10.232.83.118:45626 dest: / > 10.232.83.118:50010 > 2012-02-23 23:04:03,467 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_6330134749736235430_170577 src: /10.232.83.114:49175 dest: / > 10.232.83.118:50010 > 2012-02-23 23:04:05,153 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.232.83.118:50010, dest: /10.232.83.118:45615, bytes: 67633152, op: > HDFS_READ, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: 0, > srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: > blk_-7285361301892533992_165555, duration: 27134342000 > 2012-02-23 23:04:08,569 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.232.83.118:45626, dest: /10.232.83.118:50010, bytes: 67108864, op: > HDFS_WRITE, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: > 0, srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: > blk_-4467275870825484381_170577, duration: 6906584000 > 2012-02-23 23:04:08,570 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > block blk_-4467275870825484381_170577 terminating > 2012-02-23 23:04:08,572 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_6927577191995683160_170577 src: /10.232.83.118:45629 dest: / > 10.232.83.118:50010 > 2012-02-23 23:04:09,283 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_7440488846881064366_170577 src: /10.232.83.86:60436 dest: / > 10.232.83.118:50010 > > I have checked gc logs, but no pauses where noted (all full gc pauses > <10ms). > > Any idea of what the problem can be? > > I use HB: 0.92.0 and HDFS 1.0.0 > Thanks > Mikael.S >
