I've been very happy with HBase, and am very much looking forward to more stable releases in the future. Today, I had another one of those unfortunate crashes that seems to occur every few days and need some help understanding how I can speed up the recovery, which is taking longer than usual. I'm running on CDH3.
Right now, I'm getting log messages printed out at a rate of 100's / second in the master log file. They start with: "2010-08-31 23:55:15,886 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_PROCESS_OPEN:" And end with: "a of b" Where a counts up to b each second. I seem to remember that I used to see b count down during a previous recover. So, for example, I might get 200 messages one second with lines ending in "1 of 200", "2 of 200", ... "200 of 200". Then the next second b might be 199, so the lines would end in "1 of 199", "2 of 199", .... "199 of 199". Unfortunately, right now, b seems to stay constant at 148 for a half hour. The only work HBase appears to be doing is printing hundreds of log messages. It says all the region servers are online. DFS is healthy with proper replication. The machines are under low load, having no other jobs or services running on them. Region servers have either 4 or 6 GB allocated to them. The machines appear to all have CPU utilization of under 15%. Not all of the region servers are showing progress... on at least one of them I can see messages of the form: "2010-09-01 00:14:35,209 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:" These are appearing VERY SLOWLY, and other region servers appear to be completely idle while this is going on. I really need some help to get things back up and running. I have people who are waiting to get work done. How can I convince HBase to just startup and stop fooling around? (Is the INFO log level intended to be so verbose?) Thank you for your help, Matthew
