Can you pastebin master log snippet with regard to the dead server ?
> On Sep 8, 2015, at 5:16 PM, 伍照坤 <[email protected]> wrote: > > Hi, Guys > > I encountered a serious problem in Production, the HMaster schedule lots of > balance jobs to a dead node. > > Environment: hbase-1.0.0-cdh.4.0, hadoop-2.6.0-cdh5.4.0, > zookeeper-3.4.5-cdh5.4.0 > > the region server e3ecmrhdp24 is dead from 09/03/2015. > I checked the Zookeeper /hbase/rs, and HBase WebUI, this server is dead node. > > But the hmaster still schedule lots of balance jobs to e3ecmrhdp24 after this > region server is dead. > > the balance job runs every 5 minutes, which schedules 60000+ region balance > on this dead region server. > > #1 the balancer on hmaster will schedule region to balance to e3ecmrhdp24. > #2 after 1 seconds, the hmaster assign this region to another region server > > I guess > #1 e3ecmrhdp24 is still a live node in HMaster memory. > #2 the number of regions on e3ecmrhdp24 is less than the balance ratio, so > the balancer always schedule region to this dead server. > > After I restarted the HMaster, this problem is gone. > > It looks a critical bug in HBase, any hints? > > > > > >
