[ 
https://issues.apache.org/jira/browse/HBASE-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wellington Chevreuil resolved HBASE-27871.
------------------------------------------
    Resolution: Fixed

> Meta replication stuck forever if wal it's still reading gets rolled and 
> deleted
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-27871
>                 URL: https://issues.apache.org/jira/browse/HBASE-27871
>             Project: HBase
>          Issue Type: Bug
>          Components: meta replicas
>    Affects Versions: 2.6.0, 2.4.16, 2.4.17, 2.5.4
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>             Fix For: 2.6.0, 2.4.18, 2.5.6
>
>
> This affects branch-2 based releases only (in master, HBASE-26416 refactored 
> region replication to not rely on the replication framework anymore).
> Per the original [meta region replicas 
> design|https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit],
>  we use most of the replication framework for communicating changes in the 
> primary replica back to the secondary ones, but we skip storing the queue 
> state in ZK. In the event of a region replication crash, we should let the 
> related replication source thread be interrupted, so that 
> RegionReplicaReplicationEndpoint would set a new source from the scratch and 
> make sure to update the secondary replicas.
>  
> We have run into a situation in one of our customers' cluster where the 
> region replica source faced a long lag (probably because the RSes hosting the 
> secondary replicas were busy and slower in processing the region replication 
> entries), so that the current wal got rolled and eventually deleted whilst 
> the replication source reader was still referring it. In such cases, 
> ReplicationSourceReader only sees the IOException and keeps retrying the read 
> indefinitely, but since the file is now gone, it will just get stuck there 
> forever. In the particular case of FNFE (which I believe would only happen 
> for region replication), we should just raise an exception and let 
> RegionReplicaReplicationEndpoint handle it to reset the region replication 
> source.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to