Hi Rural, Thats interesting. Since you are passing hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by HBase? If you experience the issue again, can you try to obtain a jstack (as the user that started the hbase process or try from the RS UI if responsive rs:port/dump) as Ted suggested? the output of "top -H -p <PID>" might be useful too where <PID> is the pid of the RS. If you have some metrics monitoring it would be interesting to see how callQueueLength and the blocked threads change over time.
cheers, esteban. -- Cloudera, Inc. On Tue, Jul 8, 2014 at 6:58 PM, Rural Hunter <[email protected]> wrote: > No. I used the standard log4j file and there is not any network problem > from the client. I checked the web admin ui and the master still take the > slave as working. Just the request count is very small(about 10 while > others are several hundreds). I sshed on the slave server and I can see the > 60020 is open by netstat command. But I am not able to telnet the port even > on the server itself. It just timed out. This situation is same as the > client from other servers. After it recovered automatically, I can telnet > to the 60020 port on both the slave server and other servers. > > This is my server configuration: http://pastebin.com/Ks4cCiaE > > Client configuration: > myConf.set("hbase.zookeeper.quorum", hbaseQuorum); > myConf.set("hbase.client.retries.number", "3"); > myConf.set("hbase.client.pause", "1000"); > myConf.set("hbase.client.max.perserver.tasks", "10"); > myConf.set("hbase.client.max.perregion.tasks", "10"); > myConf.set("hbase.client.ipc.pool.size", "5"); > myConf.set("zookeeper.recovery.retry", "1"); > > The error of the client: > Exception in thread "main" > org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed after attempts=3, exceptions: > Mon Jul 07 19:10:35 CST 2014, org.apache.hadoop.hbase. > client.RpcRetryingCaller@69eb9518, > org.apache.hadoop.net.ConnectTimeoutException: > 20000 millis timeout while waiting for channel to be ready for connect. ch > : java.nio.channels.SocketChannel[connection-pending remote=slave2/ > 192.168.2.88:60020] > Mon Jul 07 19:10:58 CST 2014, org.apache.hadoop.hbase. > client.RpcRetryingCaller@69eb9518, > org.apache.hadoop.net.ConnectTimeoutException: > 20000 millis timeout while waiting for channel to be ready for connect. ch > : java.nio.channels.SocketChannel[connection-pending remote=slave2/ > 192.168.2.88:60020] > Mon Jul 07 19:11:23 CST 2014, org.apache.hadoop.hbase. > client.RpcRetryingCaller@69eb9518, > org.apache.hadoop.net.ConnectTimeoutException: > 20000 millis timeout while waiting for channel to be ready for connect. ch > : java.nio.channels.SocketChannel[connection-pending remote=slave2/ > 192.168.2.88:60020] > > at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries( > RpcRetryingCaller.java:134) > at org.apache.hadoop.hbase.client.HTable.delete(HTable.java:831) > > 于 2014/7/9 1:02, Esteban Gutierrez 写道: > > Hello Rural, >> >> It doesn't seem to be a problem from the region server from what I can >> tell. The RS is not showing in the logs any message about a long pause >> (unless you have a non standard log4j.properties file) and also if the RS >> was in a very long pause due GC or any other issue, then the master should >> have considered this region server as dead and from the logs doesn't look >> like that happened. Have you double checked from the client side for any >> connectivity issue to the RS? can you pastebin the client and the HBase >> cluster confs? >> >> cheers, >> esteban. >> >> >> -- >> Cloudera, Inc. >> >> >> >> On Tue, Jul 8, 2014 at 2:14 AM, Rural Hunter <[email protected]> >> wrote: >> >> OK, I will try to do that when it happens again. Thanks. >>> >>> 于 2014/7/8 17:06, Ted Yu 写道: >>> >>> Next time this happens, can you take jstack of the region server and >>> >>>> pastebin it ? >>>> >>>> Thanks >>>> >>>> >>>> >>>> >
