Region server request throughput drops to zero

James Baldassari Sun, 03 Oct 2010 20:06:09 -0700

Hi,

We've been having a strange problem with our HBase cluster recently (0.20.5
+ HBASE-2599 + IHBase-0.20.5).  Everything will be working fine, doing
mostly gets at 5-10k/sec and an hourly bulk insert (using HTable puts) that
can spike the total throughput up to 15-50k ops/sec, but at some point the
cluster gets into this state where the request throughput (gets and puts)
drops to zero across 5 of our 6 region servers.  Restarting the whole
cluster is the only way to fix the problem, but it gets back into that bad
state again after 4-12 hours.


Nothing in the region server or master logs indicates any errors except
occasional DFS client timeouts.  The logs look exactly like they do during
normal operation, even with debug logging on.  I have GC logging on as well,
and there are no long GC pauses (the region servers have 11G of heap).  When
the request rate drops the load is low on the region servers, there is
little to no I/O wait, and there are no messages in the region server logs
indicating that the region servers are busy doing anything like a
compaction.  It seems like the region servers just decided to stop
processing requests.  We have three different client applications sending
requests to HBase, and they all drop to zero requests/second at the same
time, so I don't think it's an issue on the client side.  There are no
errors in our client logs either.

Our hbase-site.xml is here: http://pastebin.com/cJ4cnH5W

Any ideas what could be causing the cluster to freeze up?  I guess my next
plan is to get thread dumps on the region servers and the clients the next
time it happens.  Is there somewhere else I should look other than the
master and region server logs?

Thanks,
James

Region server request throughput drops to zero

Reply via email to