Hi, Ted Thanks for reply.
here is the log the master shutdown this region server, it never starts again. ----------------------- 2015-09-03 15:29:33,738 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [e3ecmrhdp24.mercury.corp,60020,1441316616368] 2015-09-03 15:29:33,848 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for e3ecmrhdp24.mercury.corp,60020,1441316616368 before assignment; region count=0 2015-09-03 15:29:33,851 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [e3ecmrhdp24.mercury.corp,60020,1441316616368] 2015-09-03 15:29:33,853 INFO org.apache.hadoop.hbase.master.SplitLogManager: hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting is empty dir, no logs to split 2015-09-03 15:29:33,853 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting 0 logs in [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting] for [e3ecmrhdp24.mercury.corp,60020,1441316616368] 2015-09-03 15:29:33,855 INFO org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more than or equal to) 0 bytes in 0 log files in [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting] in 2ms 2015-09-03 15:29:33,856 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 0 region(s) that e3ecmrhdp24.mercury.corp,60020,1441316616368 was carrying (and 0 regions(s) that were opening on this server) 2015-09-03 15:29:33,856 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of e3ecmrhdp24.mercury.corp,60020,1441316616368 2015-09-03 15:29:36,399 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: totalSize=417.02 KB, freeSize=395.54 MB, max=395.95 MB, blockCount=0, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=269245, evicted=0, evictedPerRun=0.0 2015-09-08 17:25 GMT-07:00 Ted Yu <[email protected]>: > Can you pastebin master log snippet with regard to the dead server ? > > > > > On Sep 8, 2015, at 5:16 PM, 伍照坤 <[email protected]> wrote: > > > > Hi, Guys > > > > I encountered a serious problem in Production, the HMaster schedule lots > of balance jobs to a dead node. > > > > Environment: hbase-1.0.0-cdh.4.0, hadoop-2.6.0-cdh5.4.0, > zookeeper-3.4.5-cdh5.4.0 > > > > the region server e3ecmrhdp24 is dead from 09/03/2015. > > I checked the Zookeeper /hbase/rs, and HBase WebUI, this server is dead > node. > > > > But the hmaster still schedule lots of balance jobs to e3ecmrhdp24 after > this region server is dead. > > > > the balance job runs every 5 minutes, which schedules 60000+ region > balance on this dead region server. > > > > #1 the balancer on hmaster will schedule region to balance to > e3ecmrhdp24. > > #2 after 1 seconds, the hmaster assign this region to another region > server > > > > I guess > > #1 e3ecmrhdp24 is still a live node in HMaster memory. > > #2 the number of regions on e3ecmrhdp24 is less than the balance ratio, > so the balancer always schedule region to this dead server. > > > > After I restarted the HMaster, this problem is gone. > > > > It looks a critical bug in HBase, any hints? > > > > > > > > > > > > >
