Re: techniques in trouble-shooting RPC Timeout problem

CW Chung Mon, 28 Sep 2015 18:22:43 -0700

Hi,

This is the first time for me to post a question in this forum. Pls pardon
me (and let me know) if I am doing something less proper :-) Thanks!


In running a load test, of about 25M HBase Requests (1/4 read, 3/4 create
or update), we got about 160K with timeout exceptions. We need help in
trouble-shooting what is going on. Here are some basic info:

1. I am using HBase 1.0.0 with Cloudera CDH5.3.1. Key client settings are:
a. Client retries number set to 2 (yes, it is rather tight!)
b. Default pause between retries (100 msec I think)
c. RPC timeout is set to 500 (500 msec)
d. IPC socket timeout is default

2.  The errors were of intermittent in nature: sometimes one or two request
were affected, sometimes in a batch of tens or hundred requests spanning
several seconds.
3. We have several HBase clients running in a load-balanced manner.
Sometimes all HBase clients have some failures, but at other times,  only
one of them has problem in accessing HBase for a second, or for a few
seconds. At other times, another node would have issues in accessing HBase.
4. The error message:

23-Sep-2015 23:18:37.826 SEVERE [AsyncFileHandlerWriter-1414644648]
com.... ::
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
1 action: IOException: 1 time

To analysis the problem, the first step is to add in the client code to
dump out more info for debugging. So how to configure/code in the HBase
client such that :
1. How to print out the region server that the client was trying to connect.
2. How to print out the specific error that cause such exception?
3. In looking at netstat on the client side, we also found out two to three
thousands of sockets in TIME_WAIT state (which means the client are closing
the sockets). Is this to be expected? I would expect a socket between the
client and the Region Server is a long-live socket, and thus the number of
sockets in TIME_WAIT state should be rather small during the normal
operations. Is my understanding correct?
4. If #3 is true, what could be the underlying reason of the large number
of sockets in TIME_WAIT state
5. I would like to gain a better understanding of the underlying the HBase
connection behavior: RPC, socket, retry, etc. Besides going through the
HBases source code, are there any resources
available (blog, jira, etc)?   For source code, is the relevant codes in
src/main/java/org/apache/hadoop/hbas/client/?

Any pointers or suggestions are appreciated!

Thanks a lot in advance!

-cw-

Re: techniques in trouble-shooting RPC Timeout problem

Reply via email to