It follows exponential back off. Each pause is longer than the last one and all adds up close to 600.
On Thu, Aug 18, 2011 at 12:09 PM, Srikanth P. Shreenivas < [email protected]> wrote: > My apologies, I may not be reading the code right. > > You are right, it is GridGain timeout that is making the line 1255 to > execute. > However, the question is what would make a HTable.get() to take close to 10 > minutes to induce a timeout in GridGain task. > > The value of numRetries at line 1236 should be 10 (default) and if we go > with default value of HConstants.RETRY_BACKOFF, then, sleep time added with > all retries will be only 61 seconds, and not close to 600 seconds as the > case in our code is. > > > Regards, > Srikanth > > > ________________________________________ > From: Srikanth P. Shreenivas > Sent: Friday, August 19, 2011 12:21 AM > To: [email protected] > Subject: RE: Query regarding HTable.get and timeouts > > Please note that line numbers I am referencing are from the file : > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java > > > ________________________________________ > From: Srikanth P. Shreenivas > Sent: Friday, August 19, 2011 12:19 AM > To: [email protected] > Subject: RE: Query regarding HTable.get and timeouts > > Hi Stack, > > Thanks a lot for your reply. It's always a comforting feeling to see very > active community and especially your prompt replies to the queries. > > Yes, I am running it in as GridGain task, so it runs it GridGain's thread > pool. In this case, we can imaging GridGain as something that hands off > works to various worker threads and waits asynhronously for it complete. I > have 10 minute timeout after which GridGain would consider work as timed > out. > > What we are observing is that our tasks are timeing out at 10 minute > boundary, and delay seems to be caused by the part of the work which is > doing HTable.get. > > My suspicion is that Line 1255 in HConnectionManager.java is calling the > Thread.currentThread().interrupt(), due to which the GridGain thread kind of > stops doing what it was meant to do, and never responsds to master node > resulting in timeout in master. > > In order for line 1255 to execute, we will have to assume that all retries > were exhausted. > Hence, my query that what would cause a HTable.get() to get into a > situation wherein > HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets > to line 1255. > > > Regards, > Srikanth > > ________________________________________ > From: [email protected] [[email protected]] on behalf of Stack [ > [email protected]] > Sent: Friday, August 19, 2011 12:03 AM > To: [email protected] > Subject: Re: Query regarding HTable.get and timeouts > > Is your client running inside a container of some form and could the > container be doing the interrupting? I've not come across > client-side thread interrupts before. > St.Ack > > On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas > <[email protected]> wrote: > > Hi, > > > > We are experiencing an issue in our HBase Cluster wherein some of the > gets are timing outs at: > > > > java.io.IOException: Giving up trying to get region server: thread is > interrupted. > > at > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1016) > > at > org.apache.hadoop.hbase.client.HTable.get(HTable.java:546) > > > > > > When we look at the logs of master, zookeeper and region servers, there > is nothing that indicates anything abnormal. > > > > I tried looking up below functions, but at this point could not make much > out of it. > > > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java > - getRegionServerWithRetries starts at Line 1233 > > > https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java > Htable.get starts at Line 611. > > > > > > If you can please suggest what are the scenarios in which all retries can > get exhausted resulting in thread interruption. > > > > We have seen this issue in two of our HBase Clusters, where load is quite > less. We have 20 reads per minute, we run 1 zookeeper, and 4 regionservers > in fully-distributed mode (Hadoop). We are using CDH3. > > > > Thanks, > > Srikanth > > > > ________________________________ > > > > http://www.mindtree.com/email/disclaimer.html > > >
