This was the first occurrence of balancing onto e3ecmrhdp24 : 2015-09-03 18:00:31,137 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=ecitem:IM_ItemBase,69,1440541971138.93a12ec8a63d6954e0432e8b9d7c0922., src=e3ecmrhdp33.mercury. corp,60020,1438626881418, dest=e3ecmrhdp24.mercury.corp,60020,1438626879309
Prior to the above, there was no indication that e3ecmrhdp24 came back to life - cause it didn't. I noticed that DEBUG logging was off. Is it possible to turn on DEBUG logging ? BTW please redact server names in the logs you upload in the future (e.g. you can call e3ecmrhdp24 X as long as all occurrences of e3ecmrhdp24 are called X but no other server is called X). Cheers On Tue, Sep 8, 2015 at 6:14 PM, 伍照坤 <[email protected]> wrote: > Hi, Ted > > Thanks, i attached the log in tar.gz in dropbox. > https://www.dropbox.com/s/czes89w5r3rr1wa/hbase-log.tar.gz?dl=0 > > > the dead server name: e3ecmrhdp24 > > it looks after i truncate another table, the master start to balance > regions to dead node. > > ------------------------------ > 2015-09-03 17:57:28,689 INFO org.apache.hadoop.hbase.master.HMaster: > Client=tw79//172.16.31.133 truncate ecitem:IVT_ItemInventory > > > ----------------- > > 2015-09-08 17:47 GMT-07:00 Ted Yu <[email protected]>: > > > Can you pastebin more of the master log after 15:29:33,856 w.r.t. > > e3ecmrhdp24 ? > > > > I wonder how master thought e3ecmrhdp24 became live again. > > > > On Tue, Sep 8, 2015 at 5:37 PM, 伍照坤 <[email protected]> wrote: > > > > > Hi, Ted > > > > > > Thanks for reply. > > > > > > here is the log the master shutdown this region server, it never starts > > > again. > > > ----------------------- > > > 2015-09-03 15:29:33,738 INFO > > > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer > > > ephemeral node deleted, processing expiration > > > [e3ecmrhdp24.mercury.corp,60020,1441316616368] > > > 2015-09-03 15:29:33,848 INFO > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting > > > logs for e3ecmrhdp24.mercury.corp,60020,1441316616368 before > assignment; > > > region count=0 > > > 2015-09-03 15:29:33,851 INFO > > > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers > > > [e3ecmrhdp24.mercury.corp,60020,1441316616368] > > > 2015-09-03 15:29:33,853 INFO > > > org.apache.hadoop.hbase.master.SplitLogManager: > > > > > > > > > hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting > > > is empty dir, no logs to split > > > 2015-09-03 15:29:33,853 INFO > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 0 > logs > > in > > > > > > > > > [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting] > > > for [e3ecmrhdp24.mercury.corp,60020,1441316616368] > > > 2015-09-03 15:29:33,855 INFO > > > org.apache.hadoop.hbase.master.SplitLogManager: finished splitting > (more > > > than or equal to) 0 bytes in 0 log files in > > > > > > > > > [hdfs://nameservice1/hbase/WALs/e3ecmrhdp24.mercury.corp,60020,1441316616368-splitting] > > > in 2ms > > > 2015-09-03 15:29:33,856 INFO > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: > > Reassigning 0 > > > region(s) that e3ecmrhdp24.mercury.corp,60020,1441316616368 was > carrying > > > (and 0 regions(s) that were opening on this server) > > > 2015-09-03 15:29:33,856 INFO > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished > > > processing of shutdown of e3ecmrhdp24.mercury.corp,60020,1441316616368 > > > 2015-09-03 15:29:36,399 INFO > > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: totalSize=417.02 KB, > > > freeSize=395.54 MB, max=395.95 MB, blockCount=0, accesses=0, hits=0, > > > hitRatio=0, cachingAccesses=0, cachingHits=0, > > > cachingHitsRatio=0,evictions=269245, evicted=0, evictedPerRun=0.0 > > > > > > > > > 2015-09-08 17:25 GMT-07:00 Ted Yu <[email protected]>: > > > > > > > Can you pastebin master log snippet with regard to the dead server ? > > > > > > > > > > > > > > > > > On Sep 8, 2015, at 5:16 PM, 伍照坤 <[email protected]> wrote: > > > > > > > > > > Hi, Guys > > > > > > > > > > I encountered a serious problem in Production, the HMaster schedule > > > lots > > > > of balance jobs to a dead node. > > > > > > > > > > Environment: hbase-1.0.0-cdh.4.0, hadoop-2.6.0-cdh5.4.0, > > > > zookeeper-3.4.5-cdh5.4.0 > > > > > > > > > > the region server e3ecmrhdp24 is dead from 09/03/2015. > > > > > I checked the Zookeeper /hbase/rs, and HBase WebUI, this server is > > dead > > > > node. > > > > > > > > > > But the hmaster still schedule lots of balance jobs to e3ecmrhdp24 > > > after > > > > this region server is dead. > > > > > > > > > > the balance job runs every 5 minutes, which schedules 60000+ region > > > > balance on this dead region server. > > > > > > > > > > #1 the balancer on hmaster will schedule region to balance to > > > > e3ecmrhdp24. > > > > > #2 after 1 seconds, the hmaster assign this region to another > region > > > > server > > > > > > > > > > I guess > > > > > #1 e3ecmrhdp24 is still a live node in HMaster memory. > > > > > #2 the number of regions on e3ecmrhdp24 is less than the balance > > ratio, > > > > so the balancer always schedule region to this dead server. > > > > > > > > > > After I restarted the HMaster, this problem is gone. > > > > > > > > > > It looks a critical bug in HBase, any hints? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
