> How do you check the switch/link/uplink ? This is entirely dependent on how your network is put together, and with what components, and what type of monitoring is in place.
Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) >________________________________ > From: Mikael Sitruk <[email protected]> >To: [email protected]; Andrew Purtell <[email protected]> >Sent: Thursday, February 23, 2012 2:23 PM >Subject: Re: Exception in hbase 0.92. with DFS, - Bad connect ack > > >Thanks Andrew, for the quick response. >I suspect also the network, i have ssh that is blocking from time to time. How >do you check the switch/link/uplink ? > > >Mikael.S > > >On Fri, Feb 24, 2012 at 12:06 AM, Andrew Purtell <[email protected]> wrote: > >Check your switch/link/uplink utilization. >> >>HDFS-941 might help. That is not in Hadoop 1.0 according to a cursory search >>over branch history in the Git mirror. >> >> >>As another datapoint, we see this in our production with a Hadoop that is >>much closer to CDH3; but, we have some known issues with the network design >>in our legacy datacenters and plan to resolve it with an eventual relocation. >>I'm also integrating HDFS-941. >> >> >>Best regards, >> >> >> - Andy >> >>Problems worthy of attack prove their worth by hitting back. - Piet Hein (via >>Tom White) >> >> >> >> >>----- Original Message ----- >>> From: Mikael Sitruk <[email protected]> >>> To: [email protected] >>> Cc: >>> Sent: Thursday, February 23, 2012 1:25 PM >>> Subject: Exception in hbase 0.92. with DFS, - Bad connect ack >>> >>> Hi >>> >>> I see that i have in my hbase logs a lot of the following (target IP is >>> changing) >>> 2012-02-23 23:04:02,699 INFO org.apache.hadoop.hdfs.DFSClient: Exception in >>> createBlockOutputStream 10.232.83.87:50010 java.io.IOException: Bad connect >>> ack with firstBadLink as 10.232.83.118:50010 >>> 2012-02-23 23:04:02,699 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning >>> block blk_4678388308309640326_170570 >>> 2012-02-23 23:04:02,701 INFO org.apache.hadoop.hdfs.DFSClient: Excluding >>> datanode 10.232.83.118:50010 >>> >>> Then checking the hdfs log of the same server (87) >>> 2012-02-23 23:04:02,698 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock >>> blk_4678388308309640326_170570 received exception >>> java.net.SocketTimeoutException: 66000 millis timeout while waiting for >>> channel to be ready for connect. ch : >>> java.nio.channels.SocketChannel[connection-pending remote=/ >>> 10.232.83.118:50010] >>> 2012-02-23 23:04:02,699 ERROR >>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >> >>> 10.232.83.87:50010, >>> storageID=DS-1257662823-10.232.83.87-50010-1329398253085, infoPort=50075, >>> ipcPort=50020):DataXceiver >>> java.net.SocketTimeoutException: 66000 millis timeout while waiting for >>> channel to be ready for connect. ch : >>> java.nio.channels.SocketChannel[connection-pending remote=/ >>> 10.232.83.118:50010] >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) >>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:319) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) >>> at java.lang.Thread.run(Thread.java:662) >>> >>> >>> Looking at the target (118) server hdfs log does not seems to show any >>> problem around the same time. >>> 2012-02-23 23:04:01,648 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >>> 10.232.83.118:45623, dest: /10.232.83.118:50010, bytes: 67108864, op: >>> HDFS_WRITE, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: >>> 0, srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: >>> blk_-1747243057136009792_170577, duration: 6932047000 >>> 2012-02-23 23:04:01,649 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for >>> block blk_-1747243057136009792_170577 terminating >>> 2012-02-23 23:04:01,656 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block >>> blk_-4467275870825484381_170577 src: /10.232.83.118:45626 dest: / >>> 10.232.83.118:50010 >> >>> 2012-02-23 23:04:03,467 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block >>> blk_6330134749736235430_170577 src: /10.232.83.114:49175 dest: / >>> 10.232.83.118:50010 >>> 2012-02-23 23:04:05,153 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> >>> 10.232.83.118:50010, dest: /10.232.83.118:45615, bytes: 67633152, op: >>> HDFS_READ, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: 0, >>> srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: >>> blk_-7285361301892533992_165555, duration: 27134342000 >>> 2012-02-23 23:04:08,569 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >>> 10.232.83.118:45626, dest: /10.232.83.118:50010, bytes: 67108864, op: >>> HDFS_WRITE, cliID: DFSClient_hb_rs_shaked118,60020,1329985953141, offset: >>> 0, srvID: DS-1348867834-10.232.83.118-50010-1329398246569, blockid: >>> blk_-4467275870825484381_170577, duration: 6906584000 >>> 2012-02-23 23:04:08,570 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for >>> block blk_-4467275870825484381_170577 terminating >>> 2012-02-23 23:04:08,572 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block >>> blk_6927577191995683160_170577 src: /10.232.83.118:45629 dest: / >>> 10.232.83.118:50010 >> >>> 2012-02-23 23:04:09,283 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block >>> blk_7440488846881064366_170577 src: /10.232.83.86:60436 dest: / >>> 10.232.83.118:50010 >>> >>> I have checked gc logs, but no pauses where noted (all full gc pauses >>> <10ms). >>> >>> Any idea of what the problem can be? >>> >>> I use HB: 0.92.0 and HDFS 1.0.0 >>> Thanks >>> Mikael.S >>> >> > > > >-- > >Mikael.S > > >
