Nick Dimiduk created HBASE-24293: ------------------------------------ Summary: Assignment manager should never give up assigning meta Key: HBASE-24293 URL: https://issues.apache.org/jira/browse/HBASE-24293 Project: HBase Issue Type: Bug Components: master, Region Assignment Affects Versions: 2.3.0 Reporter: Nick Dimiduk
Not yet sure how we got here, but, {noformat} 2020-04-29 22:39:16,140 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=308, state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure server= host-a.example.com,16020,1588033841562, splitWal=true, meta=true found a region state=OFFLINE, location=null, table=hbase:meta, region=1588230740 which is no longer on us host-a.example.com,16020,1588033841562, give up assigning... {noformat} Assignment manager gives up on this procedure and nothing can progress. Manual intervention is necessary. >From this [conditional >block|https://github.com/apache/hbase/blob/1415a82d41a1e125440014a4b23364371b30d065/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L475], > it seems the {{regionNode}} location is {{null}}. {noformat} // This is possible, as when a server is dead, TRSP will fail to schedule a RemoteProcedure // to us and then try to assign the region to a new RS. And before it has updated the region // location to the new RS, we may have already called the am.getRegionsOnServer so we will // consider the region is still on us. And then before we arrive here, the TRSP could have // updated the region location, or even finished itself, so the region is no longer on us // any more, we should not try to assign it again. Please see HBASE-23594 for more details. if (!serverName.equals(regionNode.getRegionLocation())) { LOG.info("{} found a region {} which is no longer on us {}, give up assigning...", this, regionNode, serverName); continue; } {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)