Duo Zhang created HBASE-20634:
---------------------------------

             Summary: Reopen region while server crash can cause the procedure 
to be stuck
                 Key: HBASE-20634
                 URL: https://issues.apache.org/jira/browse/HBASE-20634
             Project: HBase
          Issue Type: Bug
            Reporter: Duo Zhang


Found this when implementing HBASE-20424, where we will transit the peer sync 
replication state while there is server crash.

The problem is that, in ServerCrashAssign, we do not have the region lock, so 
it is possible that after we call handleRIT to clear the existing 
assign/unassign procedures related to this rs, and before we schedule the 
assign procedures, it is possible that that we schedule a unassign procedure 
for a region on the crashed rs. This procedure will not receive the 
ServerCrashException, instead, in addToRemoteDispatcher, it will find that it 
can not dispatch the remote call and then a  FailedRemoteDispatchException will 
be raised. But we do not treat this exception the same with 
ServerCrashException, instead, we will try to expire the rs. Obviously the rs 
has already been marked as expired, so this is almost a no-op. Then the 
procedure will be stuck there for ever.

A possible way to fix it is to treat FailedRemoteDispatchException the same 
with ServerCrashException, as it will be created in addToRemoteDispatcher only, 
and the only reason we can not dispatch a remote call is that the rs has 
already been dead. The nodeMap is a ConcurrentMap so I think we could use it as 
a guard.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to