Wei-Chiu Chuang created HDFS-11019:
--------------------------------------

             Summary: Inconsistent number of corrupt replicas if a corrupt 
replica is reported multiple times
                 Key: HDFS-11019
                 URL: https://issues.apache.org/jira/browse/HDFS-11019
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
         Environment: CDH5.7.2 
            Reporter: Wei-Chiu Chuang


While investigating a block corruption issue, I found the following warning 
message in the namenode log:

{noformat}
(a client reports a block replica is corrupt)
2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1073803461 added as corrupt on 
10.0.0.63:50010 by /10.0.0.62  because client machine reported it
2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK* invalidateBlock: 
blk_1073803461_74513(stored=blk_1073803461_74553) on 10.0.0.63:50010
2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK* InvalidateBlocks: add 
blk_1073803461_74513 to 10.0.0.63:50010

(another client reports a block replica is corrupt)
2016-10-12 10:07:37,728 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1073803461 added as corrupt on 
10.0.0.63:50010 by /10.0.0.64  because client machine reported it
2016-10-12 10:07:37,728 INFO BlockStateChange: BLOCK* invalidateBlock: 
blk_1073803461_74513(stored=blk_1073803461_74553) on 10.0.0.63:50010

(ReplicationMonitor thread kicks in to invalidate the replica and add a new one)
2016-10-12 10:07:37,888 INFO BlockStateChange: BLOCK* ask 10.0.0.56:50010 to 
replicate blk_1073803461_74553 to datanode(s) 10.0.0.63:50010
2016-10-12 10:07:37,888 INFO BlockStateChange: BLOCK* BlockManager: ask 
10.0.0.63:50010 to delete [blk_1073803461_74513]

(the two maps are inconsistent)
2016-10-12 10:08:00,335 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent number 
of corrupt replicas for blk_1073803461_74553 blockMap has 0 but corrupt 
replicas map has 1
{noformat}

It seems that when a corrupt block replica is reported twice, blockMap corrupt 
and corrupt replica map becomes inconsistent.

Looking at the log, I suspect the bug is in {{BlockManager#removeStoredBlock}}. 
When a corrupt replica is reported, BlockManager removes the block from 
blocksMap. If the block is already removed (that is, the corrupt replica is 
reported twice), return; Otherwise (that is, the corrupt replica is reported 
the first time), remove the block from corruptReplicasMap (The block is added 
into corruptReplicasMap in BlockerManager#markBlockAsCorrupt) Therefore, after 
the second corruption report, the corrupt replica is removed from blocksMap, 
but the one in corruptReplicasMap is not removed.

I can’t tell what’s the impact that they are inconsistent. But I feel it's a 
good idea to fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to