Hello, As of lately, we have been having issues with Region Servers crashing in our cluster. This happens while running Map/Reduce jobs over HBase tables in particular but also spontaneously when the cluster is seemingly idle.
Restarting the Region Servers or even HBase entirely as well as HDFS and Map/Reduce services does not fix the problem and jobs will fail during the next attempt citing "Region not served" exceptions. It is not always the same nodes that crash. The log data during the minutes leading up to the crash contain many "File does not exist /hbase/<table_name>/..." error messages which change to "Too many open files" messages, finally, there are a few "Failed to renew lease for DFSClient" messages followed by several "FATAL" messages about HLog not being able to synch and immediately afterwards a terminal "ABORTING region server". You can find an extract of a Region Server log here: http://pastebin.com/G39LQyQT. Running "hbase hbck" reveals inconsistencies in some tables, but attempting a repair with "hbase hbck -repair" stalls due to some regions being in transition, see here: http://pastebin.com/JAbcQ4cc. The setup contains 30 machines, 26GB RAM each, the services are managed using CDH4, so HBase version is 0.92.x. We did not tweak any of the default configuration settings, however table scans are done with sensible scan/batch/filter settings. Data intake is about 100GB/day which are added at a time when no Map/Reduce jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with an average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV. What can we do to fix these issues? Are they symptomic of a mal-configured setup or some critical threshold level being reached? The cluster used to be stable. Thank you, /David
