Hi, This is the first time for me to post a question in this forum. Pls pardon me (and let me know) if I am doing something less proper :-) Thanks!
In running a load test, of about 25M HBase Requests (1/4 read, 3/4 create or update), we got about 160K with timeout exceptions. We need help in trouble-shooting what is going on. Here are some basic info: 1. I am using HBase 1.0.0 with Cloudera CDH5.3.1. Key client settings are: a. Client retries number set to 2 (yes, it is rather tight!) b. Default pause between retries (100 msec I think) c. RPC timeout is set to 500 (500 msec) d. IPC socket timeout is default 2. The errors were of intermittent in nature: sometimes one or two request were affected, sometimes in a batch of tens or hundred requests spanning several seconds. 3. We have several HBase clients running in a load-balanced manner. Sometimes all HBase clients have some failures, but at other times, only one of them has problem in accessing HBase for a second, or for a few seconds. At other times, another node would have issues in accessing HBase. 4. The error message: 23-Sep-2015 23:18:37.826 SEVERE [AsyncFileHandlerWriter-1414644648] com.... :: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: IOException: 1 time To analysis the problem, the first step is to add in the client code to dump out more info for debugging. So how to configure/code in the HBase client such that : 1. How to print out the region server that the client was trying to connect. 2. How to print out the specific error that cause such exception? 3. In looking at netstat on the client side, we also found out two to three thousands of sockets in TIME_WAIT state (which means the client are closing the sockets). Is this to be expected? I would expect a socket between the client and the Region Server is a long-live socket, and thus the number of sockets in TIME_WAIT state should be rather small during the normal operations. Is my understanding correct? 4. If #3 is true, what could be the underlying reason of the large number of sockets in TIME_WAIT state 5. I would like to gain a better understanding of the underlying the HBase connection behavior: RPC, socket, retry, etc. Besides going through the HBases source code, are there any resources available (blog, jira, etc)? For source code, is the relevant codes in src/main/java/org/apache/hadoop/hbas/client/? Any pointers or suggestions are appreciated! Thanks a lot in advance! -cw-
