Hi, We've been having a strange problem with our HBase cluster recently (0.20.5 + HBASE-2599 + IHBase-0.20.5). Everything will be working fine, doing mostly gets at 5-10k/sec and an hourly bulk insert (using HTable puts) that can spike the total throughput up to 15-50k ops/sec, but at some point the cluster gets into this state where the request throughput (gets and puts) drops to zero across 5 of our 6 region servers. Restarting the whole cluster is the only way to fix the problem, but it gets back into that bad state again after 4-12 hours.
Nothing in the region server or master logs indicates any errors except occasional DFS client timeouts. The logs look exactly like they do during normal operation, even with debug logging on. I have GC logging on as well, and there are no long GC pauses (the region servers have 11G of heap). When the request rate drops the load is low on the region servers, there is little to no I/O wait, and there are no messages in the region server logs indicating that the region servers are busy doing anything like a compaction. It seems like the region servers just decided to stop processing requests. We have three different client applications sending requests to HBase, and they all drop to zero requests/second at the same time, so I don't think it's an issue on the client side. There are no errors in our client logs either. Our hbase-site.xml is here: http://pastebin.com/cJ4cnH5W Any ideas what could be causing the cluster to freeze up? I guess my next plan is to get thread dumps on the region servers and the clients the next time it happens. Is there somewhere else I should look other than the master and region server logs? Thanks, James
