On Sun, Mar 25, 2012 at 1:23 AM, Lior Schachter <[email protected]> wrote:
> Hi all,
> We use hbase 0.9.2. We recently started to experience region servers

You mean 0.90.2? Or 0.92.0?

> crashed under heavy load (2-3 different servers crashes eah load).
> Seems like missing block in HDFS causes a full GC and regions are being
> closed.

Not at all.

So first we can see that your region server was doing Full GCs back to
back because it's not able to collect anything (look how the numbers
aren't decreasing). This eventually leads to a session timeout in
zookeeper and at some point your region server woke up and saw that it
lost control of the HDFS files caused by IO fencing (I know those
exceptions look bad, but it's "normal").

Now to fix this, there are multiple avenues to explore. The non-stop
GCing means that the memory is completely full, but it grew slow
enough that it got a concurrent mode failure before a
OutOfMemoryError. Here's what's using memory in HBase:

 - MemStores
 - Block Cache
 - Client requests
 - Background tasks like flushing and compacting

The first two you can control the amount of memory they use, have you
tweaked that?

And you say this is happening under heavy load (I'm guessing
inserts?), so it might be that the client requests carry payloads that
the region server can't possibly hold all at the same time. Im 0.90
and 0.92 the amount of memory dedicated for this is unbounded,
starting in 0.94 this will come in handy:

https://issues.apache.org/jira/browse/HBASE-5190

At the same time I'm pretty sure you have some blocking and/or
splitting going on and the client requests are just sitting there in
the region server memory (grep -i your logs for "block" to confirm)
while this happens.

At this point there's 3 things you can do:

 - Use bulk loading instead of brute forcing it into HBase.
 - Tune HBase in order to block as less as possible if you still like
brute forcing. This means setting bigger regions, bigger memstores,
and enabling the compactions to block at higher than 7 files.
 - Tune the IPC queue capacity in order to have less calls sitting in
the RS memory. If you're on 0.90.2 the "easy" way to do it is to have
less handlers by setting hbase.regionserver.handler.count to less than
10. On 0.92 you can tweak it more directly with
ipc.server.max.queue.size where the default is 100 (or 1000 in 0.90.2
FWIW). This was all discussed in
https://issues.apache.org/jira/browse/HBASE-3813


Hope this helps in some ways,

J-D

Reply via email to