I've been very happy with HBase, and am very much looking forward to more 
stable releases in the future.    Today, I had another one of those unfortunate 
crashes that seems to occur every few days and need some help understanding how 
I can speed up the recovery, which is taking longer than usual.   I'm running 
on CDH3.  

Right now, I'm getting log messages printed out at a rate of 100's / second in 
the master log file.  

They start with: "2010-08-31 23:55:15,886 INFO 
org.apache.hadoop.hbase.master.ServerManager: Processing 
MSG_REPORT_PROCESS_OPEN:"

And end with:  "a of b"

Where a counts up to b each second.  I seem to remember that I used to see b 
count down during a previous recover.  So, for example, I might get 200 
messages one second with lines ending in "1 of 200", "2 of 200", ... "200 of 
200".  Then the next second  b might be 199, so the lines would end in "1 of 
199", "2 of 199", ....  "199 of 199". 

Unfortunately, right now, b seems to stay constant at 148 for a half hour.   
The only work HBase appears to be doing is printing hundreds of log messages.  

It says all the region servers are online.  DFS is healthy with proper 
replication.  The machines are under low load, having no other jobs or services 
running on them.  Region servers have either 4 or 6 GB allocated to them. The 
machines appear to all have CPU utilization of under 15%.  

Not all of the region servers are showing progress... on at least one of them I 
can see messages of the form: 

"2010-09-01 00:14:35,209 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:"

These are appearing VERY SLOWLY, and other region servers appear to be 
completely idle while this is going on.  

I really need some help to get things back up and running.  I have people who 
are waiting to get work done.  

How can I convince HBase to just startup and stop fooling around?  (Is the INFO 
log level intended to be so verbose?)

Thank you for your help, 

Matthew


Reply via email to