RE: HBase Read and Write Issues in Mutlithreaded Environments

Srikanth P. Shreenivas Tue, 19 Jul 2011 10:58:36 -0700

Doug, St.ack,

We changed our production setup to CDH3 to resolve below mentioned issue.
I noticed that even though the severs were running JDK 1.6 u25 (as per 
JAVA_HOME in hbase-env.sh), I still ran into read taking more a than minute 
issue.
So, I have added -XX:+UseMembar and it seems to be okay after that.


We have a parallel system that is still running CDH2.  I added -XX:+UseMembar 
there too as that too had this issue and was only worse.  It too seems to be 
lot stabler now.

Regards
Srikanth



-----Original Message-----
From: Doug Meil [mailto:[email protected]] 
Sent: Friday, July 15, 2011 9:06 PM
To: [email protected]
Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments


Glad to hear things are better Srikanth.  I'll add that to the
Troubleshooting chapter too to make it a little more obvious.



On 7/15/11 11:30 AM, "Srikanth P. Shreenivas"
<[email protected]> wrote:

>Hi St.Ack,
>
>I stumbled upon http://hbase.apache.org/book.html#d730e4957 in one of the
>other mail threads in HBase user mailing list.
>
>We realized that we were running JVM 1.6.0_20-b02, and hence, we tried
>adding -XX:+UseMembar as suggested in above mentioned FAQ.
>This seems to have resolved the issue.  I ran the test app for 20 minutes
>with no read timeouts.
>
>
>Thanks for all the help.
>
>Regards,
>Srikanth
>
>
>
>-----Original Message-----
>From: Srikanth P. Shreenivas
>Sent: Sunday, July 10, 2011 5:20 PM
>To: [email protected]
>Subject: RE: HBase Read and Write Issues in Mutlithreaded Environments
>
>Hi St.Ack,
>
>I noticed that one of the region server machines had time running one day
>in future.
>I corrected the date. I ran into some issues after restarting, I was
>getting error with respect to .META. and stuff which I did not understand
>much.  Also, status command in hbase shell was displaying "3 servers, 1
>dead" whereas I had only 3 region server.
>
>So, I cleaned the "/hbase" (to get to real problem) and restarted the
>hbase nodes.
>
>After starting all the 3 nodes of HBase, I ran the test app again and was
>observing the log files of all the 3 region servers.
>I noticed that when test app seemed hung, the web app's thread that was
>serving the request has gone to sleep at below code. I think it stayed
>like that for around 10 minutes before Tomcat probably interrupted it.
>
>Thread-#8 - Thread t@29
>   java.lang.Thread.State: TIMED_WAITING
>        at java.lang.Thread.sleep(Native Method)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.locateRegionInMeta(HConnectionManager.java:791)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.locateRegion(HConnectionManager.java:589)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.relocateRegion(HConnectionManager.java:564)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.getRegionLocation(HConnectionManager.java:415)
>        at
>org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCall
>able.java:57)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.getRegionServerWithRetries(HConnectionManager.java:1002)
>        at
>org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:514)
>        at
>org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:133)
>        at
>org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:95)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.prefetchRegionCache(HConnectionManager.java:648)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.locateRegionInMeta(HConnectionManager.java:702)
>        - locked java.lang.Object@75826e08
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.locateRegion(HConnectionManager.java:593)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.relocateRegion(HConnectionManager.java:564)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.getRegionLocation(HConnectionManager.java:415)
>        at
>org.apache.hadoop.hbase.client.ServerCallable.instantiateServer(ServerCall
>able.java:57)
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.getRegionServerWithRetries(HConnectionManager.java:1002)
>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>        <.. app specific trace removed ...>
>        at
>java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.
>java:886)
>        at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:908)
>        at java.lang.Thread.run(Thread.java:619)
>
>==========================================================================
>==================
>After 10 minutes, web app log showed:
>2011-07-10 16:50:28,804 [Thread-#8] ERROR
>[persistence.handler.HBaseHandler]  - Exception occurred in searchData:
>java.io.IOException: Giving up trying to get region server: thread is
>interrupted.
>        at
>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementatio
>n.getRegionServerWithRetries(HConnectionManager.java:1016)
>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>
>==========================================================================
>==================
>I did not see anything happening on region server either, the log had
>occasional entries like these:
>
>2011-07-10 16:43:53,648 DEBUG
>org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.52 MB,
>free=788.08 MB, max=794.6 MB, blocks=0, accesses=1080, hits=0,
>hitRatio=0.00%%, cachingAccesses=0, cachingHits=0, cachingHitsRatio=ï¿1Ž2%,
>evictions=0, evicted=0, evictedPerRun=NaN
>2011-07-10 16:48:53,649 DEBUG
>org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.52 MB,
>free=788.08 MB, max=794.6 MB, blocks=0, accesses=1080, hits=0,
>hitRatio=0.00%%, cachingAccesses=0, cachingHits=0, cachingHitsRatio=ï¿1Ž2%,
>evictions=0, evicted=0, evictedPerRun=NaN
>2011-07-10 16:53:53,648 DEBUG
>org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=6.52 MB,
>free=788.08 MB, max=794.6 MB, blocks=0, accesses=1080, hits=0,
>hitRatio=0.00%%, cachingAccesses=0, cachingHits=0, cachingHitsRatio=ï¿1Ž2%,
>evictions=0, evicted=0, evictedPerRun=NaN
>2
>
>
>
>
>
>Regards,
>Srikanth
>
>
>-----Original Message-----
>From: [email protected] [mailto:[email protected]] On Behalf Of Stack
>Sent: Saturday, July 09, 2011 9:41 PM
>To: [email protected]
>Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments
>
>You read the requirements section in our docs and you have upped the
>ulimits, nprocs, etc?  http://hbase.apache.org/book/os.html
>
>If you know the row, can you deduce the regionserver its talking too?
>(Below is the client failure -- we need to figure whats up on
>server-side).  Once you've done that, can you check its logs?  See if
>you can figure anything on why the hang?
>
>Thanks,
>St.Ack
>
>On Sat, Jul 9, 2011 at 6:14 AM, Srikanth P. Shreenivas
><[email protected]> wrote:
>> Hi St.Ack,
>>
>> We upgraded to CDH 3 (hadoop-0.20-0.20.2+923.21-1.noarch.rpm,
>>hadoop-hbase-0.90.1+15.18-1.noarch.rpm,
>>hadoop-zookeeper-3.3.3+12.1-1.noarch.rpm).
>>
>> I ran a the same test which I was running for the app when it was
>>running on CDH2.  The test app posts a request the web app every 100ms,
>>and the web app reads a HBase record, performs some logic, and saves an
>>audit trail by writing another HBase record.
>>
>> When our app was running on CDH2, I observed the below issue for every
>>10 to 15 requests.
>> With CDH3, this issue is not happening at all.  So, seems like
>>situation has improved a lot, and our app seems to be lot more stable.
>>
>> However, I am still seeing an issue though.  There are many requests
>>(around 1%) which are not able to read the record from the HBase, and
>>the get call is hanging for almost 10 minutes.  This is what I see in
>>application log:
>>
>> 2011-07-09 18:27:25,537 [gridgain-#6%authGrid%] ERROR
>>[my.app.HBaseHandler]  - Exception occurred in searchData:
>> java.io.IOException: Giving up trying to get region server: thread is
>>interrupted.
>>        at
>>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementati
>>on.getRegionServerWithRetries(HConnectionManager.java:1016)
>>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:546)
>>
>>        <...app specific trace removed...>
>>
>>        at
>>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>>        at
>>java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>        at
>>org.gridgain.grid.util.runnable.GridRunnable.run(GridRunnable.java:194)
>>        at
>>java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor
>>.java:886)
>>        at
>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav
>>a:908)
>>        at java.lang.Thread.run(Thread.java:619)
>>
>>
>> I am running the test on the same record, so all by "get" are for same
>>row id.
>>
>>
>>
>> It will be of immense help if you can provide some inputs on whether we
>>are missing some configuration settings, or is there a way to get around
>>this.
>>
>> Thanks,
>> Srikanth
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of
>>Stack
>> Sent: Wednesday, June 29, 2011 7:48 PM
>> To: [email protected]
>> Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments
>>
>> Go to CDH3 if you can.  CDH2 is also old.
>> St.Ack
>>
>> On Wed, Jun 29, 2011 at 7:15 AM, Srikanth P. Shreenivas
>> <[email protected]> wrote:
>>> Thanks St. Ack for the inputs.
>>>
>>> Will upgrading to CDH3 help or is there a version within CDH2 that you
>>>recommend we should upgrade to?
>>>
>>> Regards,
>>> Srikanth
>>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of
>>>Stack
>>> Sent: Wednesday, June 29, 2011 11:16 AM
>>> To: [email protected]
>>> Subject: Re: HBase Read and Write Issues in Mutlithreaded Environments
>>>
>>> Can you upgrade?  That release is > 18 months old.  A bunch has
>>> happened in the meantime.
>>>
>>> For retries exhausted, check whats going on on the remote regionserver
>>> that you are trying to write too.  Its probably struggling and thats
>>> why requests are not going through -- or the client missed the fact
>>> that region moved (all stuff that should be working better in latest
>>> hbase).
>>>
>>> St.Ack
>>>
>>> On Tue, Jun 28, 2011 at 9:51 PM, Srikanth P. Shreenivas
>>> <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> We are using HBase 0.20.3 (hbase-0.20-0.20.3-1.cloudera.noarch.rpm)
>>>>cluster in distributed mode with Hadoop 0.20.2
>>>>(hadoop-0.20-0.20.2+320-1.noarch).
>>>> We are using pretty much default configuration, and only thing we
>>>>have customized is that we have allocated 4GB RAM in
>>>>/etc/hbase-0.20/conf/hbase-env.sh
>>>>
>>>> In our setup, we have a web application that reads a record from
>>>>HBase and writes a record as part of each web request.   The
>>>>application is hosted in Apache Tomcat 7 and is a stateless web
>>>>application providing a REST-like web service API.
>>>>
>>>> We are observing that our reads and writes times out once in a
>>>>while.  This happens more for writes.
>>>> We see below exception in our application logs:
>>>>
>>>>
>>>> Exception Type 1 - During Get:
>>>> ---------------------------------------
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>>>contact region server 10.1.68.36:60020 for region
>>>>employeedata,be8784ac8b57c45625a03d52be981b88097c2fdc,1308657957879,
>>>>row 'd51b74eb05e07f96cee0ec556f5d8d161e3281f3', but failed after 10
>>>>attempts.
>>>> Exceptions:
>>>> java.io.IOException: Call to /10.1.68.36:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>> java.nio.channels.ClosedByInterruptException
>>>>
>>>>        at
>>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegio
>>>>nServerWithRetries(HConnectionManager.java:1048)
>>>>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:417)
>>>>     <snip>
>>>>
>>>> Exception  Type 2 - During Put:
>>>> ---------------------------------------------
>>>> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException:
>>>>Trying to contact region server 10.1.68.34:60020 for region
>>>>audittable,,1309183872019, row
>>>>'2a012017120f80a801b28f5f66a83dc2a8882d1b', but failed after 10
>>>>attempts.
>>>> Exceptions:
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>> java.io.IOException: Call to /10.1.68.34:60020 failed on local
>>>>exception: java.nio.channels.ClosedByInterruptException
>>>>
>>>>        at
>>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegio
>>>>nServerWithRetries(HConnectionManager.java:1048)
>>>>        at
>>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers$3.doCall
>>>>(HConnectionManager.java:1239)
>>>>        at
>>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.pr
>>>>ocess(HConnectionManager.java:1161)
>>>>        at
>>>>org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processB
>>>>atchOfRows(HConnectionManager.java:1247)
>>>>        at
>>>>org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>>>>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:474)
>>>>     <snip>
>>>>
>>>> Any inputs on why this is happening, or how to rectify it will be of
>>>>immense help.
>>>>
>>>> Thanks,
>>>> Srikanth
>>>>
>>>>
>>>>
>>>> Srikanth P Shreenivas|Principal Consultant | MindTree Ltd.|Global
>>>>Village, RVCE Post, Mysore Road, Bangalore-560 059, INDIA|Voice +91 80
>>>>26264000 / Fax +91 80 2626 4100| Mob: 9880141059|email:
>>>>[email protected]<mailto:[email protected]>
>>>>|www.mindtree.com<http://www.mindtree.com/> |
>>>>
>>>>
>>>> ________________________________
>>>>
>>>> http://www.mindtree.com/email/disclaimer.html
>>>>
>>>
>>

RE: HBase Read and Write Issues in Mutlithreaded Environments

Reply via email to