First, I saw:
2010-09-21 11:30:05,122 DEBUG org.apache.hadoop.hbase.master.RegionServerOperationQueue: Put ProcessServerShutdown of 10.103.2.5,60020,1285042335711 back on queue 2010-09-21 11:30:05,122 DEBUG org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo: ProcessServerShutdown of 10.103.2.5,60020,1285042335711 2010-09-21 11:30:05,122 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of server 10.103.2.5,60020,1285042335711: logSplit: false, rootRescanned: false, n umberOfMetaRegions: 1, onlineMetaRegions.size(): 0 repeated rapidly for 20 mins or so. Then: Bunch of regions got unassigned: 2010-09-21 12:00:07,782 DEBUG org.apache.hadoop.hbase.master.RegionManager: Unassigning 66 regions from 10.103.2.3,60020,1285042333293 2010-09-21 12:00:07,782 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region img816,img2103r.jpg,1285003791610.1592893332 2010-09-21 12:00:07,782 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region img534,92166039.jpg,1284949117852.1009352950 2010-09-21 12:00:07,782 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region img36,abcwu.jpg,1285001278990.272235177 Restarting master did not help. Ultimately what brought the cluster back up, is full shutdown of regionservers, and masters, and then bring all up. Any ideas what might have happened here? We are running: HBase Version 0.89.20100726, r979826 Hadoop Version 0.20.2+320, r9b72d268a0b590b4fd7d13aca17c1c453f8bc957 Regions On FS 5057 3 zookeepers and 13 regionservers. -Jack
