Wei-Chiu Chuang created HDFS-11022:
--------------------------------------

             Summary: DataNode unable to remove corrupt block replica due to 
race condition
                 Key: HDFS-11022
                 URL: https://issues.apache.org/jira/browse/HDFS-11022
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode, namenode
    Affects Versions: 2.6.0
         Environment: CDH5.7.0
            Reporter: Wei-Chiu Chuang
            Priority: Critical



Scenario:
# A client reads a replica blk_A_x from a data node and detected corruption.
# In the meantime, the replica is appended, updating its generation stamp from 
x to y.
# The client tells NN to mark the replica blk_A_x corrupt.
# NN tells the data node to (1) delete replica blk_A_x and (2) replicate the 
newer replica blk_A_y from another datanode. Due to block placement policy, 
blk_A_y is replicated to the same node. (It's a small cluster)
# DN is unable to receive the newer replica blk_A_y, because the replica 
already exists.
# DN is also unable to delete replica blk_A_y because blk_A_y does not exist.
# The replica on the DN is not part of data pipeline, so it becomes stale.

If another replica becomes corrupt and NameNode wants to replicate a healthy 
replica to this DataNode, it can't, because a stale replica exists. Because 
this is a small cluster, soon enough (in a matter of a hour) no DataNode is 
able to receive a healthy replica.

This cluster also suffers from HDFS-11019, so even though DataNode later 
detected data corruption, it was unable to report to NameNode.

Note that we are still investigating the root cause of the corruption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to