Re: Struggling with Region Servers Running out of Memory

ramkrishna vasudevan Tue, 30 Oct 2012 21:46:05 -0700

Are you writing fat cells?

Did you try raising the heap size? and see if still it is crashing?


Regards
Ram

On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <[email protected]> wrote:

> So I'm looking at ganglia so the numbers are somewhat approximate (this is
> for a server that just crashed about an 1/2 hour ago due to running out of
> memory):
>
> Store files are hovering just below 1k.  Over the last 24 hours it has
> varied by about 100 files (I'm looking at hbase.regionserver.storefiles)**
> .
>
> Block cache count is about 24k varied by about 2k.  Our block cache free
> goes between 0.7G and 0.4G.  It looks like we have almost 3G free after
> restarting a region server.
>
> The evicted block count went from 210k to 320k over a 24 hour period.  Hit
> ratio is close to 100 (the graph isn't very detailed so I'm guess it is
> like 98-99%).
>
> Block cache size stays at about 2GB.
>
> ~Jeff
>
>
>
> On 10/30/2012 6:21 PM, Jeff Whiting wrote:
>
>> We have no coprossesors.  We are running replication from this cluster to
>> another one.
>>
>> What is the best way to see how many store files we have? Or checking on
>> the block cache?
>>
>> ~Jeff
>>
>> On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote:
>>
>>> Hi
>>>
>>> Are you using any coprocessors? Can you see how many store files are
>>> created?
>>>
>>> The no of blocks getting cached will give you an idea too..
>>>
>>> Regards
>>> Ram
>>>
>>> On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[email protected]>
>>> wrote:
>>>
>>>  We have 6 region server given 10G of memory for hbase.  Each region
>>>> server
>>>> has an average of about 100 regions and across the cluster we are
>>>> averaging
>>>> about 100 requests / second with a pretty even read / write load.  We
>>>> are
>>>> running cdh4 (0.92.1-cdh4.0.1, rUnknown)
>>>>
>>>> I feel that looking over our load and our requests that the 10GB of
>>>> memory
>>>> should be enough to handle the load and that we shouldn't really be
>>>> pushing
>>>> the the memory limits.
>>>>
>>>> However what we are seeing is that our memory usage goes up slowly until
>>>> the region server starts sputtering due to gc collection issues and it
>>>> will
>>>> eventually get timed out by zookeeper and be killed.
>>>>
>>>> We'll see aborts like this in the log:
>>>> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547:
>>>> Unhandled exception: org.apache.hadoop.hbase.****YouAreDeadException:
>>>> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net
>>>> ,60020,****1351233245547
>>>> as dead server
>>>> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> RegionServer abort: loaded coprocessors are: []
>>>> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547:
>>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-**
>>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf
>>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-**
>>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf received
>>>> expired from ZooKeeper, aborting
>>>> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> RegionServer abort: loaded coprocessors are: []
>>>>
>>>> Which are "caused" by:
>>>> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>>> We
>>>> slept 29014ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>> >
>>>> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>>> We
>>>> slept 28121ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>> >
>>>> 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>>> We
>>>> slept 31124ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>> >
>>>> 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>>> We
>>>> slept 32209ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>> >
>>>> 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>>> We
>>>> slept 32557ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>> >
>>>> 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>>> We
>>>> slept 33741ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>> >
>>>>
>>>>
>>>> We'll also see a bunch of responseTooSlow and operationTooSlow as GC
>>>> kicks
>>>> in and really kills the region server's performance.
>>>>
>>>>
>>>> We have the jvm metrics kicking out to ganglia and looking at
>>>> jvm.RegionServer.metrics.****memHeapUsedM you can see that it will go
>>>> up
>>>> over time and eventually run out of memory.  I can also see in
>>>> hmaster:60010/master-status that the usedHeapMB just goes up and I can
>>>> make
>>>> a pretty educated guess as to what server will go down next. It will
>>>> take
>>>> several days to a week of continuous running (after restarting a region
>>>> server) before we have a potential problem.
>>>>
>>>> Our next one to go will probably be ds6 and jmap -heap shows:
>>>> concurrent mark-sweep generation:
>>>>     capacity = 10398531584 (9916.8125MB)
>>>>     used     = 9036165000 (8617.558479309082MB)
>>>>     free     = 1362366584 (1299.254020690918MB)
>>>>     86.89847145248619% used
>>>>
>>>> So we are using 86% of the 10GB heep allocated to the concurrent mark
>>>> and
>>>> sweep generation.  Looking at ds6 in the web interface where has
>>>> information about the a tasks it isn't running rpc stuff it doesn't show
>>>> any compactions or any background tasks happening. Nor is there any
>>>> active
>>>> rpc call that are longer than 0 seconds (it seems to be handling the
>>>> requests just fine).
>>>>
>>>> At this point I feel somewhat lost as to how to debug the problem. I'm
>>>> not
>>>> sure what to do next to figure out what is going on.  Any suggestions
>>>> as to
>>>> what to look for or debug where the memory is being used? I can generate
>>>> heap dumps via jmap (although it effectively kills the region server)
>>>> but I
>>>> don't really know what to look for to see where the memory is going. I
>>>> also
>>>> have jmx setup on each region server and can connect to it that way.
>>>>
>>>> Thanks,
>>>> ~Jeff
>>>>
>>>> --
>>>> Jeff Whiting
>>>> Qualtrics Senior Software Engineer
>>>> [email protected]
>>>>
>>>>
>>>>
>>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> [email protected]
>
>

Re: Struggling with Region Servers Running out of Memory

Reply via email to