[ https://issues.apache.org/jira/browse/HBASE-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bharath Vissapragada resolved HBASE-25741. ------------------------------------------ Resolution: Fixed > Deadlock during peer cleanup with NoNodeException > ------------------------------------------------- > > Key: HBASE-25741 > URL: https://issues.apache.org/jira/browse/HBASE-25741 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 1.7.0 > Reporter: Sandeep Pal > Assignee: Sandeep Pal > Priority: Major > Labels: regression > Fix For: 1.7.0 > > > We have observed that replication source metrics for peer exists on some > region servers even though peer has been removed. This is because when we > encounter the NoNodeException in ReplicationSource, it calls the > `peerRemoved` workflow which should eventually terminate the source and > removes the source from the source manager. Now, the problem is > ReplicationSource thread terminates itself and thus the action to removePeer > is not complete leaving the metrics there forever for source. This is the > flow, replication source trying to clean wals > [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L801] > and on NoNodeException it calls the > [peerRemoved|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L244] > and terminate the source (itself), leaving the terminated source there in > sourcemanager and not clearing it's > [metrics|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L645]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)