Hi Ram, Thanks for your help. We seemed to resolve the issue today by doing application changes. (Refer my other reply to this thread).
Regards, Srikanth -----Original Message----- From: Ramkrishna S Vasudevan [mailto:[email protected]] Sent: Monday, August 22, 2011 11:56 AM To: [email protected] Subject: RE: Query regarding HTable.get and timeouts Hi Srikanth I went through your logs.. not much info is present in that. Could you give the logs which shows what happened when the master and RS started. and also to whom the ROOT and META table got assigned? Regards Ram -----Original Message----- From: Srikanth P. Shreenivas [mailto:[email protected]] Sent: Saturday, August 20, 2011 6:27 PM To: [email protected] Subject: RE: Query regarding HTable.get and timeouts Further in this investigation, we enabled the debug logs on client side. We are observing that client is trying to root region, and is continuously failing to do so. The logs are filled with entries like this: 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpl ementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020 2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG [hbase.client.HConnectionManager$HConnectionImplementation] - locateRegionInMeta parentTable=-ROOT-, metaLocation=address: DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0 of 10 failed; retrying after sleep of 1000 because: null Client keeps retrying and retries get exhausted. Complete logs are available here: https://gist.github.com/1159064 including logs of master, zookeeper and region servers. If you can please look at the logs and provide some inputs on this issue, then it will be really helpful. We are really not sure why client is failing to get root regions from the server. Any guidance will be greatly appreciated. Thanks a lot, Srikanth -----Original Message----- From: Srikanth P. Shreenivas Sent: Saturday, August 20, 2011 1:57 AM To: [email protected] Subject: RE: Query regarding HTable.get and timeouts I did some tests today. In our QA setup, we dont see any issues. I ran more than 100,000 operations in our QA setup in 1 hour with all HBase reads/writes working as expected. However, in our production setup, I regularly see the issue wherein client thread gets interrupted because HTable.get() call does not return. It is possible that it is taking more than 10 minutes due to https://issues.apache.org/jira/browse/HBASE-2121. However, I am not able to figure out what is causing this. Logs of Hbase as well as Hadoop servers seems to be quite normal. The cluster on which we are seeing this issue has no writes happening, and I do see this issue after about 10 operations. One strange thing I noticed though is that zookeeper logs were truncated and had only entries for last 15-20 minutes instead of complete day. This was around 8PM, so it had not rolled over. I had earlier too asked a query on same issue (http://www.mail-archive.com/[email protected]/msg09904.html). The changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our QA setup. However, same changes in production does not seem to have similar effect. If you can provide any clues on how should we go about investigating this issue, that will be real help. Regards, Srikanth. ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:39 AM To: [email protected] Subject: RE: Query regarding HTable.get and timeouts My apologies, I may not be reading the code right. You are right, it is GridGain timeout that is making the line 1255 to execute. However, the question is what would make a HTable.get() to take close to 10 minutes to induce a timeout in GridGain task. The value of numRetries at line 1236 should be 10 (default) and if we go with default value of HConstants.RETRY_BACKOFF, then, sleep time added with all retries will be only 61 seconds, and not close to 600 seconds as the case in our code is. Regards, Srikanth ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:21 AM To: [email protected] Subject: RE: Query regarding HTable.get and timeouts Please note that line numbers I am referencing are from the file : https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h base/client/HConnectionManager.java ________________________________________ From: Srikanth P. Shreenivas Sent: Friday, August 19, 2011 12:19 AM To: [email protected] Subject: RE: Query regarding HTable.get and timeouts Hi Stack, Thanks a lot for your reply. It's always a comforting feeling to see very active community and especially your prompt replies to the queries. Yes, I am running it in as GridGain task, so it runs it GridGain's thread pool. In this case, we can imaging GridGain as something that hands off works to various worker threads and waits asynhronously for it complete. I have 10 minute timeout after which GridGain would consider work as timed out. What we are observing is that our tasks are timeing out at 10 minute boundary, and delay seems to be caused by the part of the work which is doing HTable.get. My suspicion is that Line 1255 in HConnectionManager.java is calling the Thread.currentThread().interrupt(), due to which the GridGain thread kind of stops doing what it was meant to do, and never responsds to master node resulting in timeout in master. In order for line 1255 to execute, we will have to assume that all retries were exhausted. Hence, my query that what would cause a HTable.get() to get into a situation wherein HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets to line 1255. Regards, Srikanth ________________________________________ From: [email protected] [[email protected]] on behalf of Stack [[email protected]] Sent: Friday, August 19, 2011 12:03 AM To: [email protected] Subject: Re: Query regarding HTable.get and timeouts Is your client running inside a container of some form and could the container be doing the interrupting? I've not come across client-side thread interrupts before. St.Ack On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas <[email protected]> wrote: > Hi, > > We are experiencing an issue in our HBase Cluster wherein some of the gets are timing outs at: > > java.io.IOException: Giving up trying to get region server: thread is interrupted. > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation. getRegionServerWithRetries(HConnectionManager.java:1016) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > > > When we look at the logs of master, zookeeper and region servers, there is nothing that indicates anything abnormal. > > I tried looking up below functions, but at this point could not make much out of it. > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h base/client/HConnectionManager.java - getRegionServerWithRetries starts at Line 1233 > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h base/client/HTable.java Htable.get starts at Line 611. > > > If you can please suggest what are the scenarios in which all retries can get exhausted resulting in thread interruption. > > We have seen this issue in two of our HBase Clusters, where load is quite less. We have 20 reads per minute, we run 1 zookeeper, and 4 regionservers in fully-distributed mode (Hadoop). We are using CDH3. > > Thanks, > Srikanth > > ________________________________ > > http://www.mindtree.com/email/disclaimer.html >
