No fat rows. We have kept the default hbase client limit of 10mb. And most values
are quite small < 5k.
We haven't tried raising the memory limit and we can try raising one of the servers and see how it
does. However looking at the graphs I don't think it will help...but it is worth a try.
~Jeff
On 10/30/2012 10:45 PM, ramkrishna vasudevan wrote:
Are you writing fat cells?
Did you try raising the heap size? and see if still it is crashing?
Regards
Ram
On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <[email protected] <mailto:[email protected]>>
wrote:
So I'm looking at ganglia so the numbers are somewhat approximate (this is
for a server that
just crashed about an 1/2 hour ago due to running out of memory):
Store files are hovering just below 1k. Over the last 24 hours it has
varied by about 100
files (I'm looking at hbase.regionserver.storefiles).
Block cache count is about 24k varied by about 2k. Our block cache free
goes between 0.7G and
0.4G. It looks like we have almost 3G free after restarting a region
server.
The evicted block count went from 210k to 320k over a 24 hour period. Hit
ratio is close to
100 (the graph isn't very detailed so I'm guess it is like 98-99%).
Block cache size stays at about 2GB.
~Jeff
On 10/30/2012 6:21 PM, Jeff Whiting wrote:
We have no coprossesors. We are running replication from this cluster
to another one.
What is the best way to see how many store files we have? Or checking
on the block cache?
~Jeff
On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote:
Hi
Are you using any coprocessors? Can you see how many store files are
created?
The no of blocks getting cached will give you an idea too..
Regards
Ram
On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[email protected]
<mailto:[email protected]>> wrote:
We have 6 region server given 10G of memory for hbase. Each
region server
has an average of about 100 regions and across the cluster we
are averaging
about 100 requests / second with a pretty even read / write
load. We are
running cdh4 (0.92.1-cdh4.0.1, rUnknown)
I feel that looking over our load and our requests that the
10GB of memory
should be enough to handle the load and that we shouldn't
really be pushing
the the memory limits.
However what we are seeing is that our memory usage goes up
slowly until
the region server starts sputtering due to gc collection issues
and it will
eventually get timed out by zookeeper and be killed.
We'll see aborts like this in the log:
2012-10-29 08:10:52,132 FATAL
org.apache.hadoop.hbase.**regionserver.HRegionServer:
ABORTING region server ds5.h1.ut1.qprod.net
<http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
Unhandled exception:
org.apache.hadoop.hbase.**YouAreDeadException:
Server REPORT rejected; currently processing
ds5.h1.ut1.qprod.net
<http://ds5.h1.ut1.qprod.net>,60020,**1351233245547
as dead server
2012-10-29 08:10:52,250 FATAL
org.apache.hadoop.hbase.**regionserver.HRegionServer:
RegionServer abort: loaded coprocessors are: []
2012-10-29 08:10:52,392 FATAL
org.apache.hadoop.hbase.**regionserver.HRegionServer:
ABORTING region server ds5.h1.ut1.qprod.net
<http://ds5.h1.ut1.qprod.net>,60020,**1351233245547:
regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf
regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf
received
expired from ZooKeeper, aborting
2012-10-29 08:10:52,401 FATAL
org.apache.hadoop.hbase.**regionserver.HRegionServer:
RegionServer abort: loaded coprocessors are: []
Which are "caused" by:
2012-10-29 08:07:40,646 WARN
org.apache.hadoop.hbase.util.**Sleeper: We
slept 29014ms instead of 3000ms, this is likely due to a long
garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.**
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
2012-10-29 08:08:39,074 WARN
org.apache.hadoop.hbase.util.**Sleeper: We
slept 28121ms instead of 3000ms, this is likely due to a long
garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.**
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
2012-10-29 08:09:13,261 WARN
org.apache.hadoop.hbase.util.**Sleeper: We
slept 31124ms instead of 3000ms, this is likely due to a long
garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.**
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
2012-10-29 08:09:45,536 WARN
org.apache.hadoop.hbase.util.**Sleeper: We
slept 32209ms instead of 3000ms, this is likely due to a long
garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.**
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
2012-10-29 08:10:18,103 WARN
org.apache.hadoop.hbase.util.**Sleeper: We
slept 32557ms instead of 3000ms, this is likely due to a long
garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.**
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
2012-10-29 08:10:51,896 WARN
org.apache.hadoop.hbase.util.**Sleeper: We
slept 33741ms instead of 3000ms, this is likely due to a long
garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.**
html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
We'll also see a bunch of responseTooSlow and operationTooSlow
as GC kicks
in and really kills the region server's performance.
We have the jvm metrics kicking out to ganglia and looking at
jvm.RegionServer.metrics.**memHeapUsedM you can see that it
will go up
over time and eventually run out of memory. I can also see in
hmaster:60010/master-status that the usedHeapMB just goes up
and I can make
a pretty educated guess as to what server will go down next. It
will take
several days to a week of continuous running (after restarting
a region
server) before we have a potential problem.
Our next one to go will probably be ds6 and jmap -heap shows:
concurrent mark-sweep generation:
capacity = 10398531584 (9916.8125MB)
used = 9036165000 (8617.558479309082MB)
free = 1362366584 (1299.254020690918MB)
86.89847145248619% used
So we are using 86% of the 10GB heep allocated to the
concurrent mark and
sweep generation. Looking at ds6 in the web interface where has
information about the a tasks it isn't running rpc stuff it
doesn't show
any compactions or any background tasks happening. Nor is there
any active
rpc call that are longer than 0 seconds (it seems to be
handling the
requests just fine).
At this point I feel somewhat lost as to how to debug the
problem. I'm not
sure what to do next to figure out what is going on. Any
suggestions as to
what to look for or debug where the memory is being used? I can
generate
heap dumps via jmap (although it effectively kills the region
server) but I
don't really know what to look for to see where the memory is
going. I also
have jmx setup on each region server and can connect to it that
way.
Thanks,
~Jeff
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected] <mailto:[email protected]>
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected] <mailto:[email protected]>
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]