I checked the gc-hbase.log, the longest GC time is 9.3477570 seconds. since I already
used -XX:+UseConcMarkSweepGC.  So it is not GC issue.

the 2 minute gap is most probably the system is too busy with disk IO. I notice that a lot of times, when the system is busy, even a simple "ls" command will take several
seconds.


since this is not initial import, so I can't really use HFileOutputFormat.

In production, I do have stronger machines and I give 6G memory to regionserver, and don't colocate it
with tasktracker. but I am restricted to use it for testing.

The reason that I started this extensive testing is to understand why production machine regionserver occasionally crash, as I have seem before. Based on my testing, I realized that hbase does have problem with sustained high rate data insert . With the previous reasoning, if we keep high data rate, production machine will run out of disk IO capacity sooner or later,
before the cluster runs out of memory, CPU, or storage space.
unless we constantly monitor the disk IO usage and continue to add more regionservers, this problem will happen. The disk IO requirements goes up linearly with number of regions if we make the key uniformly
distributed.

Alternative solution seems to be design the key so that the actively inserted regions is limited and not distributed to all the regions. For lots of applications, this seems to be a hard to do, especially when
we need to consider optimization for data reading.

Jimmy.


--------------------------------------------------
From: "Jean-Daniel Cryans" <[email protected]>
Sent: Thursday, July 15, 2010 10:20 AM
To: <[email protected]>; <[email protected]>
Subject: Re: regionserver crash under heavy load

About your new crash:

2010-07-15 04:09:03,248 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=3.3544312MB (3517376), Free=405.42056MB (425114272),
Max=408.775MB (428631648), Counts: Blocks=0, Access=1985476, Hit=0,
Miss=1985476, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss
Ratio=100.0%, Evicted/Run=NaN
2010-07-15 04:11:32,791 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=3.3544312MB (3517376), Free=405.42056MB (425114272),
Max=408.775MB (428631648), Counts: Blocks=0, Access=1985759, Hit=0,
Miss=1985759, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss
Ratio=100.0%, Evicted/Run=NaN
2010-07-15 04:11:32,965 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=3.3544312MB (3517376), Free=405.42056MB (425114272),
Max=408.775MB (428631648), Counts: Blocks=0, Access=1985768, Hit=0,
Miss=1985768, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss
Ratio=100.0%, Evicted/Run=NaN
2010-07-15 04:11:33,330 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 176029ms for
sessionid 0x129c87a7f980045, closing socket connection and attempting
reconnect
2010-07-15 04:11:33,330 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 175969ms for
sessionid 0x129c87a7f980052, closing socket connection and attempting
reconnect

Your machine wasn't able to talk to the server for almost 2 minutes,
and we can see that it stopped logging for approx that time. Again,
I'd like to point out that this is a JVM/hardware issue.

My other comments inline.

J-D

On Thu, Jul 15, 2010 at 9:55 AM, Jinsong Hu <[email protected]> wrote:
I tested the hbase with client throttling and I noticed that in the
beginning, the process is limited by CPU speed. after running for several
hours, I noticed that the speed is limited by disk. Then I realized that
there is a serious flaw in hbase design:

In the beginning , the number  of regions is very small. So   hbase only
does compaction for those limited number of regions.  But as insertion
continues, we will have more and more regions,the clients continue to insert
more records to the regions, preferably uniformly . At that time lots of
regions need to be compacted. Most of the clusters has relatively stable
number of physical machines. More regions leads to more compactions, that
means for a given physical machine, it needs to support more compaction as
time go on. And the compaction most involves disk IO, so sooner or later,
the disk IO capacity will be exhausted. At that moment, the machine is not responsive any longer, and finally time out happens , and hbase regionserver
crash.

To me it looks like a GC pause with swapping and/or CPU starvation,
which leads to the region server shutting down itself.


This seems to be a design flaw for hbase. In my testing, even sustained
insertion rate of 1000 can crash the cluster. Is there any solution for this
?

Well first the values you are inserting are a lot bigger than what
HBase is optimized for out of the box.

WRT your IO exhaustion description, what you really described is the
limit of scalability for a single region server. If HBase had a single
server architecture, I'd say that you're right, but this is not the
case. In the Bigtable paper, they say they usually support 100 tablets
(regions) per TableServer (our RegionServer). When you grow more than
that, you need to add more machines! This is the scalability model,
HBase can grow to hundreds/thousands of servers without any fancy
third party software.

WRT solutions:

1- if this is your initial data import, I'd recommend you use the
HFileOutputFormat since it will be much faster and you don't have to
wait after splits/compactions to happen. See
http://hbase.apache.org/docs/r0.20.5/api/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat.html

2- if you are trying to see how HBase handles your expected "normal"
load, then I would recommend getting production-grade machines.

3- if you wish to keep the same hardware, you need to give more
ressources to HBase. Running maps and reduces on the same machine,
when you have only 4 cores, easily leads to resource contention. We
recommend running HBase with at least 4GB, you have only 2GB. Also it
needs cores, at StumbleUpon for example we use 2xi7 which accounts to
16 CPUs when counting hyperthreading, so we run 8 maps and 6 reducers
per machine leaving at least 2 for HBase (it's a database, it needs
processing power). Even when we did the initial import of our data, we
didn't have any GC issues. This is also the setup for our production
cluster, although we (almost) don't run MR jobs on it since they
usually are IO-heavy.


Jimmy.


Reply via email to