One thing you can do is to kill -9 the master process, then restart it with bin/hbase-daemon.sh start master
This will clear the master state and it will inspect the cluster when restarting to figure where things are. If that doesn't work you can also restart HBase completely. Are the region servers even able to open the regions? Any exceptions? Can you show us some logs perhaps? Do use a service like pastebin or put them on some web server. It's verbose in this case because there are a lot of regions to assign, and for debugging purposes (like right now) we need to be able to trace the movements of every region. J-D On Tue, Aug 31, 2010 at 5:19 PM, Matthew LeMieux <[email protected]> wrote: > I've been very happy with HBase, and am very much looking forward to more > stable releases in the future. Today, I had another one of those > unfortunate crashes that seems to occur every few days and need some help > understanding how I can speed up the recovery, which is taking longer than > usual. I'm running on CDH3. > > Right now, I'm getting log messages printed out at a rate of 100's / second > in the master log file. > > They start with: "2010-08-31 23:55:15,886 INFO > org.apache.hadoop.hbase.master.ServerManager: Processing > MSG_REPORT_PROCESS_OPEN:" > > And end with: "a of b" > > Where a counts up to b each second. I seem to remember that I used to see b > count down during a previous recover. So, for example, I might get 200 > messages one second with lines ending in "1 of 200", "2 of 200", ... "200 of > 200". Then the next second b might be 199, so the lines would end in "1 of > 199", "2 of 199", .... "199 of 199". > > Unfortunately, right now, b seems to stay constant at 148 for a half hour. > The only work HBase appears to be doing is printing hundreds of log messages. > > It says all the region servers are online. DFS is healthy with proper > replication. The machines are under low load, having no other jobs or > services running on them. Region servers have either 4 or 6 GB allocated to > them. The machines appear to all have CPU utilization of under 15%. > > Not all of the region servers are showing progress... on at least one of them > I can see messages of the form: > > "2010-09-01 00:14:35,209 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:" > > These are appearing VERY SLOWLY, and other region servers appear to be > completely idle while this is going on. > > I really need some help to get things back up and running. I have people who > are waiting to get work done. > > How can I convince HBase to just startup and stop fooling around? (Is the > INFO log level intended to be so verbose?) > > Thank you for your help, > > Matthew > > >
