Good morning St.Ack,
the schema consists of one table and one column family, holding five
columns with one string (<20 chars) and four double numbers (rather
minimal really).
The load test runs in 24 concurrent mappers, each writing 500k rows,
2000 runs in total.
WAL is turned on.
And yes, it took down to region servers and the processes were
eventually gone. From the logs however it looked as if the region
servers still tried to continue for a while after the first OOM.
They didn't get restarted and I had the impression the HMaster didn't
respond to web requests either (but I shut it down quickly to restart
the whole cluster - so not sure about that).
My hbase-env.sh is out-of-the-box except for the heap settings. So the
GC config is
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
is that too little aggressive?
hbase-site.xml is also standard, except for the cluster config (i.e. the
zookeeper quorom config etc).
Just noticed that there is a gc log. I will look into that as well.
Currently retrying with 2G heap.
Thanks,
Henning
On 07/11/2011 06:24 PM, Stack wrote:
On Mon, Jul 11, 2011 at 1:04 AM, Henning Blohm<[email protected]> wrote:
I am running HBASE 0.90.3 (just upgraded for testing). It is configured for
1.5G heap, which seemed to be a good setting for HBASE 0.20.6. When running
a stress test that would write into three HBASE data nodes from 24 processes
with the goal of inserting one billion simple rows, I get an OOMs at two of
three region servers after about 75% of the work is done.
Whats your schema? Whats the size of your cells? 0.90 is different
to 0.20. 1.5G is little memory but HBase should just work w/ 1G or
more of heap.
Here is the first OOM:
2011-07-09 23:34:40,988 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Applied 924, skipped 1105, firstSequenceidInLog=162957072,
maxSequenceidInLog=163841413
This looks like you are crashing regionservers. Is that so? Whats
your current GC config?
Now:
1. Is there any way to configure some stable heap size? Where is the leak?
This is really frustrating (it took a while to figure out 1.5G was "somehow
good" for 0.20.6)
Start big. Give it 8Gs? See how it does then.
How many handlers are you running with?
2. Wouldn't it make sense to let the region server die at the first OOM and
have it restarted quickly rather then letting it go on in some likely broken
state after the OOM until it eventually dies anyway?
Don't we do this currently? Only time this does not happen is when
the OOME happens out at extremities in RPC which we do not directly
control (We should fix that). It catches OOME and then tries to keep
going. Otherwise, if OOME, we'll release resevoir of memory that
we've been holding back so we can shut ourselves down.
St.Ack
--
*Henning Blohm*
*ZFabrik Software KG*
T: +49/62278399955
F: +49/62278399956
M: +49/1781891820
Bunsenstrasse 1
69190 Walldorf
[email protected] <mailto:[email protected]>
Linkedin <http://de.linkedin.com/pub/henning-blohm/0/7b5/628>
www.zfabrik.de <http://www.zfabrik.de>
www.z2-environment.eu <http://www.z2-environment.eu>