Michael Stack created HBASE-25142:
-------------------------------------

             Summary: Auto-fix 'Unknown Server'
                 Key: HBASE-25142
                 URL: https://issues.apache.org/jira/browse/HBASE-25142
             Project: HBase
          Issue Type: Improvement
            Reporter: Michael Stack


Addressing reports of 'Unknown Server' has come up in various conversations 
lately. This issue is about fixing instances of 'Unknown Server' automatically 
as part of the tasks undertaken by CatalogJanitor when it runs.

First though, would like to figure a definition for 'Unknown Server' and a list 
of ways in which they arise. We need this to figure how to do safe auto-fixing.

Currently an 'Unknown Server' is a server found in hbase:meta that is not 
online (no recent heartbeat) and that is not mentioned in the dead servers list.

In outline, I'd think CatalogJanitor could schedule an expiration of the RS 
znode in zk (if exists) and then an SCP if it finds an 'Unknown Server'. 
Perhaps it waits for 2x or 10x the heartbeat interval just-in-case (or not). 
The SCP would clean up any references in hbase:meta by reassigning Regions 
assigned the 'Unknown Server' after replaying any WALs found in hdfs attributed 
to the dead server.

As to how they arise:

 * A contrived illustration would be a large online cluster crashes down with a 
massive backlog of WAL files – zk went down for some reason say. The replay of 
the WALs look like it could take a very long time  (lets say the cluster was 
badly configured and a bug and misconfig made it so each RS was carrying 
hundreds of WALs and there are hundreds of servers). To get the service back 
online, the procedure store and WALs are moved aside (for later replay with 
WALPlayer). The cluster comes up. meta is onlined but refers to server 
instances that are no longer around. Can schedule an SCP per server mentioned 
in the 'HBCK Report' by scraping and scripting hbck2 or, better, catalogjanitor 
could just do it.

 * HBASE-24286 HMaster won't become healthy after after cloning... describes 
starting a cluster over data that is hfile-content only. In this case the 
original servers used manufacture the hfile cluster data are long dead yet meta 
still refers to the old servers. They will not make the 'dead servers' list.

Let this issue stew awhile. Meantime collect how 'Unknown Server' gets created 
and best way to fix.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to