Are you writing fat cells? Did you try raising the heap size? and see if still it is crashing?
Regards Ram On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <[email protected]> wrote: > So I'm looking at ganglia so the numbers are somewhat approximate (this is > for a server that just crashed about an 1/2 hour ago due to running out of > memory): > > Store files are hovering just below 1k. Over the last 24 hours it has > varied by about 100 files (I'm looking at hbase.regionserver.storefiles)** > . > > Block cache count is about 24k varied by about 2k. Our block cache free > goes between 0.7G and 0.4G. It looks like we have almost 3G free after > restarting a region server. > > The evicted block count went from 210k to 320k over a 24 hour period. Hit > ratio is close to 100 (the graph isn't very detailed so I'm guess it is > like 98-99%). > > Block cache size stays at about 2GB. > > ~Jeff > > > > On 10/30/2012 6:21 PM, Jeff Whiting wrote: > >> We have no coprossesors. We are running replication from this cluster to >> another one. >> >> What is the best way to see how many store files we have? Or checking on >> the block cache? >> >> ~Jeff >> >> On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote: >> >>> Hi >>> >>> Are you using any coprocessors? Can you see how many store files are >>> created? >>> >>> The no of blocks getting cached will give you an idea too.. >>> >>> Regards >>> Ram >>> >>> On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[email protected]> >>> wrote: >>> >>> We have 6 region server given 10G of memory for hbase. Each region >>>> server >>>> has an average of about 100 regions and across the cluster we are >>>> averaging >>>> about 100 requests / second with a pretty even read / write load. We >>>> are >>>> running cdh4 (0.92.1-cdh4.0.1, rUnknown) >>>> >>>> I feel that looking over our load and our requests that the 10GB of >>>> memory >>>> should be enough to handle the load and that we shouldn't really be >>>> pushing >>>> the the memory limits. >>>> >>>> However what we are seeing is that our memory usage goes up slowly until >>>> the region server starts sputtering due to gc collection issues and it >>>> will >>>> eventually get timed out by zookeeper and be killed. >>>> >>>> We'll see aborts like this in the log: >>>> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547: >>>> Unhandled exception: org.apache.hadoop.hbase.****YouAreDeadException: >>>> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net >>>> ,60020,****1351233245547 >>>> as dead server >>>> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> RegionServer abort: loaded coprocessors are: [] >>>> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547: >>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-** >>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf >>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-** >>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf received >>>> expired from ZooKeeper, aborting >>>> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> RegionServer abort: loaded coprocessors are: [] >>>> >>>> Which are "caused" by: >>>> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.****Sleeper: >>>> We >>>> slept 29014ms instead of 3000ms, this is likely due to a long garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**> >>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.** >>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>>> > >>>> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.****Sleeper: >>>> We >>>> slept 28121ms instead of 3000ms, this is likely due to a long garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**> >>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.** >>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>>> > >>>> 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.****Sleeper: >>>> We >>>> slept 31124ms instead of 3000ms, this is likely due to a long garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**> >>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.** >>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>>> > >>>> 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.****Sleeper: >>>> We >>>> slept 32209ms instead of 3000ms, this is likely due to a long garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**> >>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.** >>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>>> > >>>> 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.****Sleeper: >>>> We >>>> slept 32557ms instead of 3000ms, this is likely due to a long garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**> >>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.** >>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>>> > >>>> 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.****Sleeper: >>>> We >>>> slept 33741ms instead of 3000ms, this is likely due to a long garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**> >>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.** >>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>>> > >>>> >>>> >>>> We'll also see a bunch of responseTooSlow and operationTooSlow as GC >>>> kicks >>>> in and really kills the region server's performance. >>>> >>>> >>>> We have the jvm metrics kicking out to ganglia and looking at >>>> jvm.RegionServer.metrics.****memHeapUsedM you can see that it will go >>>> up >>>> over time and eventually run out of memory. I can also see in >>>> hmaster:60010/master-status that the usedHeapMB just goes up and I can >>>> make >>>> a pretty educated guess as to what server will go down next. It will >>>> take >>>> several days to a week of continuous running (after restarting a region >>>> server) before we have a potential problem. >>>> >>>> Our next one to go will probably be ds6 and jmap -heap shows: >>>> concurrent mark-sweep generation: >>>> capacity = 10398531584 (9916.8125MB) >>>> used = 9036165000 (8617.558479309082MB) >>>> free = 1362366584 (1299.254020690918MB) >>>> 86.89847145248619% used >>>> >>>> So we are using 86% of the 10GB heep allocated to the concurrent mark >>>> and >>>> sweep generation. Looking at ds6 in the web interface where has >>>> information about the a tasks it isn't running rpc stuff it doesn't show >>>> any compactions or any background tasks happening. Nor is there any >>>> active >>>> rpc call that are longer than 0 seconds (it seems to be handling the >>>> requests just fine). >>>> >>>> At this point I feel somewhat lost as to how to debug the problem. I'm >>>> not >>>> sure what to do next to figure out what is going on. Any suggestions >>>> as to >>>> what to look for or debug where the memory is being used? I can generate >>>> heap dumps via jmap (although it effectively kills the region server) >>>> but I >>>> don't really know what to look for to see where the memory is going. I >>>> also >>>> have jmx setup on each region server and can connect to it that way. >>>> >>>> Thanks, >>>> ~Jeff >>>> >>>> -- >>>> Jeff Whiting >>>> Qualtrics Senior Software Engineer >>>> [email protected] >>>> >>>> >>>> >> > -- > Jeff Whiting > Qualtrics Senior Software Engineer > [email protected] > >
