RE: Query regarding HTable.get and timeouts

Srikanth P. Shreenivas Mon, 22 Aug 2011 06:58:03 -0700

Hi Ram,

Thanks for your help.   We seemed to resolve the issue today by doing 
application changes. (Refer my other reply to this thread).

Regards,
Srikanth

-----Original Message-----
From: Ramkrishna S Vasudevan [mailto:[email protected]] 
Sent: Monday, August 22, 2011 11:56 AM
To: [email protected]
Subject: RE: Query regarding HTable.get and timeouts

Hi Srikanth

I went through your logs.. not much info is present in that.
Could you give the logs which shows what happened when the master and RS
started.  and also to whom the ROOT and META table got assigned?

Regards
Ram

-----Original Message-----
From: Srikanth P. Shreenivas [mailto:[email protected]] 
Sent: Saturday, August 20, 2011 6:27 PM
To: [email protected]
Subject: RE: Query regarding HTable.get and timeouts

Further in this investigation, we enabled the debug logs on client side.

We are observing that client is trying to root region, and is continuously
failing to do so.  The logs are filled with entries like this:

2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG
[hbase.client.HConnectionManager$HConnectionImplementation]  - Lookedup root
region location,
connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpl
ementation@2cc25ae3; hsa=DC1AuthDFSC1D3.cidr.gov.in:6020
2011-08-20 17:20:09,092 [gridgain-#6%authGrid%] DEBUG
[hbase.client.HConnectionManager$HConnectionImplementation]  -
locateRegionInMeta parentTable=-ROOT-, metaLocation=address:
DC1AuthDFSC1D3.cidr.gov.in:6020, regioninfo: -ROOT-,,0.70236052, attempt=0
of 10 failed; retrying after sleep of 1000 
because: null

Client keeps retrying and retries get exhausted.

Complete logs are available here: https://gist.github.com/1159064  including
logs of master, zookeeper and region servers.  

If you can please look at the logs and provide some inputs on this issue,
then it will be really helpful.
We are really not sure why client is failing to get root regions from the
server.  Any guidance will be greatly appreciated.

Thanks a lot,
Srikanth

-----Original Message-----
From: Srikanth P. Shreenivas 
Sent: Saturday, August 20, 2011 1:57 AM
To: [email protected]
Subject: RE: Query regarding HTable.get and timeouts

I did some tests today.  In our QA setup, we dont see any issues.  I ran
more than 100,000 operations in our QA setup in 1 hour with all HBase
reads/writes working as expected.

However, in our production setup, I regularly see the issue wherein client
thread gets interrupted because HTable.get() call does not return.
It is possible that it is taking more than 10 minutes due to
https://issues.apache.org/jira/browse/HBASE-2121.

However, I am not able to figure out what is causing this.  Logs of Hbase as
well as Hadoop servers seems to be quite normal. The cluster on which we are
seeing this issue has no writes happening, and I do see this issue after
about 10 operations.

One strange thing I noticed though is that zookeeper logs were truncated and
had only entries for last 15-20 minutes instead of complete day.  This was
around 8PM, so it had not rolled over.

I had earlier too asked a query on same issue
(http://www.mail-archive.com/[email protected]/msg09904.html).  The
changes we did of moving to CDH3 and using -XX:UseMemBar fixed issues in our
QA setup.  However, same changes in production does not seem to have similar
effect.

If you can provide any clues on how should we go about investigating this
issue, that will be real help.

Regards,
Srikanth.

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:39 AM
To: [email protected]
Subject: RE: Query regarding HTable.get and timeouts

My apologies, I may not be reading the code right.

You are right, it is GridGain timeout that is making the line 1255 to
execute.
However, the question is what would make a HTable.get() to take close to 10
minutes to induce a timeout in GridGain task.

The value of numRetries at line 1236 should be 10 (default) and if we go
with default value of HConstants.RETRY_BACKOFF, then, sleep time added with
all retries will be only 61 seconds, and not close to 600 seconds as the
case in our code is.

Regards,
Srikanth

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:21 AM
To: [email protected]
Subject: RE: Query regarding HTable.get and timeouts

Please note that line numbers I am referencing are from the file :
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HConnectionManager.java

________________________________________
From: Srikanth P. Shreenivas
Sent: Friday, August 19, 2011 12:19 AM
To: [email protected]
Subject: RE: Query regarding HTable.get and timeouts

Hi Stack,

Thanks a lot for your reply.  It's always a comforting feeling to see very
active community and especially your prompt replies to the queries.

Yes, I am running it in as GridGain task,  so it runs it GridGain's thread
pool.   In this case, we can imaging GridGain as something that hands off
works to various worker threads and waits asynhronously  for it complete.  I
have 10 minute timeout after which GridGain would consider work as timed
out.

What we are observing is that our tasks are timeing out at 10 minute
boundary, and delay seems to be caused by the part of the work which is
doing HTable.get.

My suspicion is that Line 1255 in HConnectionManager.java is calling the
Thread.currentThread().interrupt(), due to which the GridGain thread kind of
stops doing what it was meant to do, and never responsds to master node
resulting in timeout in master.

In order for line 1255 to execute, we will have to assume that all retries
were exhausted.
Hence, my query that what would cause a HTable.get() to get into a situation
wherein
HConnectionManager$HConnectionImplementation.getRegionServerWithRetries gets
to line 1255.

Regards,
Srikanth

________________________________________
From: [email protected] [[email protected]] on behalf of Stack
[[email protected]]
Sent: Friday, August 19, 2011 12:03 AM
To: [email protected]
Subject: Re: Query regarding HTable.get and timeouts

Is your client running inside a container of some form and could the
container be doing the interrupting?   I've not come across
client-side thread interrupts before.
St.Ack

On Thu, Aug 18, 2011 at 7:37 AM, Srikanth P. Shreenivas
<[email protected]> wrote:
> Hi,
>
> We are experiencing an issue in our HBase Cluster wherein some of the gets
are timing outs at:
>
> java.io.IOException: Giving up trying to get region server: thread is
interrupted.
>                at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.
getRegionServerWithRetries(HConnectionManager.java:1016)
>                at
org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>
> When we look at the logs of master, zookeeper and region servers, there is
nothing that indicates anything abnormal.
>
> I tried looking up below functions, but at this point could not make much
out of it.
>
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HConnectionManager.java  - getRegionServerWithRetries  starts at
Line 1233
>
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/h
base/client/HTable.java  Htable.get starts at Line 611.
>
>
> If you can please suggest what are the scenarios in which all retries can
get exhausted resulting in thread interruption.
>
> We have seen this issue in two of our HBase Clusters, where load is quite
less.  We have 20 reads per minute,  we run 1 zookeeper, and 4 regionservers
in fully-distributed mode (Hadoop).  We are using CDH3.
>
> Thanks,
> Srikanth
>
> ________________________________
>
> http://www.mindtree.com/email/disclaimer.html
>

RE: Query regarding HTable.get and timeouts

Reply via email to