Thanks Andrew, for the quick response. I suspect also the network, i have ssh that is blocking from time to time. How do you check the switch/link/uplink ?
Mikael.S On Fri, Feb 24, 2012 at 12:06 AM, Andrew Purtell <[email protected]>wrote: > Check your switch/link/uplink utilization. > > HDFS-941 might help. That is not in Hadoop 1.0 according to a cursory > search over branch history in the Git mirror. > > > As another datapoint, we see this in our production with a Hadoop that is > much closer to CDH3; but, we have some known issues with the network design > in our legacy datacenters and plan to resolve it with an eventual > relocation. I'm also integrating HDFS-941. > > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > > > > ----- Original Message ----- > > From: Mikael Sitruk <[email protected]> > > To: [email protected] > > Cc: > > Sent: Thursday, February 23, 2012 1:25 PM > > Subject: Exception in hbase 0.92. with DFS, - Bad connect ack > > > > Hi > > > > I see that i have in my hbase logs a lot of the following (target IP is > > changing) > > 2012-02-23 23:04:02,699 INFO org.apache.hadoop.hdfs.DFSClient: Exception > in > > createBlockOutputStream 10.232.83.87:50010 java.io.IOException: Bad > connect > > ack with firstBadLink as 10.232.83.118:50010 > > 2012-02-23 23:04:02,699 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning > > block blk_4678388308309640326_170570 > > 2012-02-23 23:04:02,701 INFO org.apache.hadoop.hdfs.DFSClient: Excluding > > datanode 10.232.83.118:50010 > > > > Then checking the hdfs log of the same server (87) > > 2012-02-23 23:04:02,698 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > > blk_4678388308309640326_170570 received exception > > java.net.SocketTimeoutException: 66000 millis timeout while waiting for > > channel to be ready for connect. ch : > > java.nio.channels.SocketChannel[connection-pending remote=/ > > 10.232.83.118:50010] > > 2012-02-23 23:04:02,699 ERROR > > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > > 10.232.83.87:50010, > > storageID=DS-1257662823-10.232.83.87-50010-1329398253085, infoPort=50075, > > ipcPort=50020):DataXceiver > > java.net.SocketTimeoutException: 66000 millis timeout while waiting for > > channel to be ready for connect. ch : > > java.nio.channels.SocketChannel[connection-pending remote=/ > > 10.232.83.118:50010] > > at > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) > > at > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:319) > > at > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) > > at java.lang.Thread.run(Thread.java:662) > > > > > > Looking at the target (118) server hdfs log does not seems to show any > > problem around the same time. > > 2012-02-23 23:04:01,648 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 10.232.83.118:45623, dest: /10.232.83.118:50010, bytes: 67108864, op: > > HDFS_WRITE, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: > > 0, srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: > > blk_-1747243057136009792_170577, duration: 6932047000 > > 2012-02-23 23:04:01,649 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > > block blk_-1747243057136009792_170577 terminating > > 2012-02-23 23:04:01,656 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > > blk_-4467275870825484381_170577 src: /10.232.83.118:45626 dest: / > > 10.232.83.118:50010 > > 2012-02-23 23:04:03,467 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > > blk_6330134749736235430_170577 src: /10.232.83.114:49175 dest: / > > 10.232.83.118:50010 > > 2012-02-23 23:04:05,153 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 10.232.83.118:50010, dest: /10.232.83.118:45615, bytes: 67633152, op: > > HDFS_READ, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: > 0, > > srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: > > blk_-7285361301892533992_165555, duration: 27134342000 > > 2012-02-23 23:04:08,569 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > > 10.232.83.118:45626, dest: /10.232.83.118:50010, bytes: 67108864, op: > > HDFS_WRITE, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: > > 0, srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: > > blk_-4467275870825484381_170577, duration: 6906584000 > > 2012-02-23 23:04:08,570 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for > > block blk_-4467275870825484381_170577 terminating > > 2012-02-23 23:04:08,572 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > > blk_6927577191995683160_170577 src: /10.232.83.118:45629 dest: / > > 10.232.83.118:50010 > > 2012-02-23 23:04:09,283 INFO > > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > > blk_7440488846881064366_170577 src: /10.232.83.86:60436 dest: / > > 10.232.83.118:50010 > > > > I have checked gc logs, but no pauses where noted (all full gc pauses > > <10ms). > > > > Any idea of what the problem can be? > > > > I use HB: 0.92.0 and HDFS 1.0.0 > > Thanks > > Mikael.S > > > -- Mikael.S
