[ 
https://issues.apache.org/jira/browse/HDFS-7453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-7453.
---------------------------------
    Resolution: Duplicate

As per prior comments, I am resolving this as a duplicate of HDFS-7208.

> Namenode does not recognize block is missing on a datanode
> ----------------------------------------------------------
>
>                 Key: HDFS-7453
>                 URL: https://issues.apache.org/jira/browse/HDFS-7453
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.1
>            Reporter: Dmitry Bugaychenko
>            Priority: Critical
>
> We got a severe datalose due to the fact that namenode didn't recognized 
> block it thinks exist are not actually exist on the datanodes. The senario is 
> something as follows:
> # A disk fails on one of the datanodes
> # The datanode is forced to decommiss and shut down
> # The disk is replaced and datanode is back again
> # fsck shows everything is fine
> # Repeat 1-4 for few weeks
> # Restart the namenode
> # It suddenly sees tens of thousands under-replicated blocks and hundreds 
> missing blocks
> During the next disk failure we analysed situation a bit more and found 
> particular block on particular datanode that is missing: there is no file for 
> block and if we try to read it, we got
> {code}
> java.io.IOException: Got error for OP_READ_BLOCK, self=/XXX:33817, 
> remote=XXX/X.X.X.X:50010, for file 
> XXX/X.X.X.X:50010:BP-879324367-YYY-1404837025894:1083356878, for pool 
> BP-879324367-YYY-1404837025894 block 1083356878_9644290
> {code}
> We restarted the datanode and in the log we can see that it did scan all the 
> directories and send the report to namenode:
> {code}
> 2014-11-27 17:06:34,174  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] 
> FsDatasetImpl - Adding block pool BP-879324367-YYY-1404837025894
> 2014-11-27 17:06:34,175  INFO [Thread-41] FsDatasetImpl - Scanning block pool 
> BP-879324367-YYY-1404837025894 on volume /mnt/hadoop/0/dfs/data/current...
> 2014-11-27 17:06:34,176  INFO [Thread-43] FsDatasetImpl - Scanning block pool 
> BP-879324367-YYY-1404837025894 on volume /mnt/hadoop/2/dfs/data/current...
> 2014-11-27 17:06:34,176  INFO [Thread-42] FsDatasetImpl - Scanning block pool 
> BP-879324367-YYY-1404837025894 on volume /mnt/hadoop/1/dfs/data/current...
> 2014-11-27 17:06:34,279  INFO [Thread-42] FsDatasetImpl - Cached dfsUsed 
> found for 
> /mnt/hadoop/1/dfs/data/current/BP-879324367-YYY-1404837025894/current: 
> 62677866794
> 2014-11-27 17:06:34,282  INFO [Thread-42] FsDatasetImpl - Time taken to scan 
> block pool BP-879324367-YYY-1404837025894 on /mnt/hadoop/1/dfs/data/current: 
> 105ms
> 2014-11-27 17:06:34,744  INFO [Thread-41] FsDatasetImpl - Cached dfsUsed 
> found for 
> /mnt/hadoop/0/dfs/data/current/BP-879324367-YYY-1404837025894/current: 
> 2465590681232
> 2014-11-27 17:06:34,744  INFO [Thread-41] FsDatasetImpl - Time taken to scan 
> block pool BP-879324367-YYY-1404837025894 on /mnt/hadoop/0/dfs/data/current: 
> 568ms
> 2014-11-27 17:06:34,856  INFO [Thread-43] FsDatasetImpl - Cached dfsUsed 
> found for 
> /mnt/hadoop/2/dfs/data/current/BP-879324367-YYY-1404837025894/current: 
> 2475580099468
> 2014-11-27 17:06:34,857  INFO [Thread-43] FsDatasetImpl - Time taken to scan 
> block pool BP-879324367-YYY-1404837025894 on /mnt/hadoop/2/dfs/data/current: 
> 680ms
> 2014-11-27 17:06:34,857  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] 
> 2014-11-27 17:06:34,858  INFO [Thread-44] FsDatasetImpl - Adding replicas to 
> map for block pool BP-879324367-YYY-1404837025894 on volume 
> /mnt/hadoop/0/dfs/data/current...
> 2014-11-27 17:06:34,890  INFO [Thread-46] FsDatasetImpl - Adding replicas to 
> map for block pool BP-879324367-YYY-1404837025894 on volume 
> /mnt/hadoop/2/dfs/data/current...
> 2014-11-27 17:06:34,890  INFO [Thread-45] FsDatasetImpl - Adding replicas to 
> map for block pool BP-879324367-YYY-1404837025894 on volume 
> /mnt/hadoop/1/dfs/data/current...
> 2014-11-27 17:06:34,961  INFO [Thread-45] FsDatasetImpl - Time to add 
> replicas to map for block pool BP-879324367-YYY-1404837025894 on volume 
> /mnt/hadoop/1/dfs/data/current: 70ms
> 2014-11-27 17:06:36,083  INFO [Thread-44] FsDatasetImpl - Time to add 
> replicas to map for block pool BP-879324367-YYY-1404837025894 on volume 
> /mnt/hadoop/0/dfs/data/current: 1193ms
> 2014-11-27 17:06:36,162  INFO [Thread-46] FsDatasetImpl - Time to add 
> replicas to map for block pool BP-879324367-YYY-1404837025894 on volume 
> /mnt/hadoop/2/dfs/data/current: 1271ms
> 2014-11-27 17:06:36,162  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] 
> FsDatasetImpl - Total time to add all replicas to map: 1304ms
> 2014-11-27 17:06:36,167  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] DataNode - 
> Block pool BP-879324367-YYY-1404837025894 (Datanode Uuid null) service to 
> /YYY:8020 beginning handshake with NN
> 2014-11-27 17:06:36,167  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] DataNode - 
> Block pool BP-879324367-YYY-1404837025894 (Datanode Uuid null) service to 
> /XXX:8020 beginning handshake with NN
> 2014-11-27 17:06:36,185  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] DataNode - 
> Block pool Block pool BP-879324367-YYY-1404837025894 (Datanode Uuid null) 
> service to /YYY:8020 successfully registered with NN
> 2014-11-27 17:06:36,185  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] DataNode - 
> Block pool Block pool BP-879324367-YYY-1404837025894 (Datanode Uuid null) 
> service to /XXX:8020 successfully registered with NN
> 2014-11-27 17:06:36,186  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] DataNode - 
> For namenode /YYY:8020 using DELETEREPORT_INTERVAL of 300000 msec  
> BLOCKREPORT_INTERVAL of 21600000msec CACHEREPORT_INTERVAL of 10000msec 
> Initial delay: 0msec; heartBeatInterval=3000
> 2014-11-27 17:06:36,186  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] DataNode - 
> For namenode /XXX:8020 using DELETEREPORT_INTERVAL of 300000 msec  
> BLOCKREPORT_INTERVAL of 21600000msec CACHEREPORT_INTERVAL of 10000msec 
> Initial delay: 0msec; heartBeatInterval=3000
> 2014-11-27 17:06:36,264  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] DataNode - 
> Namenode Block pool BP-879324367-YYY-1404837025894 (Datanode Uuid 
> 91e8deeb-7030-4e7b-af14-a3f35cf65bd2) service to /YYY:8020 trying to claim 
> ACTIVE state with txid=114828921
> 2014-11-27 17:06:36,265  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] DataNode - 
> Acknowledging ACTIVE Namenode Block pool BP-879324367-YYY-1404837025894 
> (Datanode Uuid 91e8deeb-7030-4e7b-af14-a3f35cf65bd2) service to /YYY:8020
> 2014-11-27 17:06:36,511  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] DataNode - 
> Sent 1 blockreports 49708 blocks total. Took 72 msec to generate and 174 
> msecs for RPC and NN processing.  Got back commands none
> 2014-11-27 17:06:36,511  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] DataNode - 
> Sent 1 blockreports 49708 blocks total. Took 76 msec to generate and 170 
> msecs for RPC and NN processing.  Got back commands 
> org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@464c42e6
> 2014-11-27 17:06:36,512  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /YYY:8020] DataNode - 
> Got finalize command for block pool BP-879324367-YYY-1404837025894
> 2014-11-27 17:06:36,519  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] GSet - 
> Computing capacity for map BlockMap
> 2014-11-27 17:06:36,523  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] GSet - 
> capacity      = 2^20 = 1048576 entries
> 2014-11-27 17:06:36,520  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] GSet - VM 
> type       = 64-bit
> 2014-11-27 17:06:36,522  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] GSet - 0.5% 
> max memory 1.9 GB = 9.9 MB
> 2014-11-27 17:06:36,524  INFO [DataNode: 
> [[[DISK]file:/mnt/hadoop/0/dfs/data/, [DISK]file:/mnt/hadoop/1/dfs/data/, 
> [DISK]file:/mnt/hadoop/2/dfs/data/]]  heartbeating to /XXX:8020] 
> BlockPoolSliceScanner - Periodic Block Verification Scanner initialized with 
> interval 504 hours for block pool BP-879324367-YYY-1404837025894
> {code}
> But block is sitll reported with two replicas, one of those is assigned to 
> datanode which hasn't it:
> {code}
> <block_info>
> <block_id>1083356878</block_id>
> <block_name>blk_1083356878</block_name>
> <file>
> <local_name>commons-lang-2.6.jar</local_name>
> <local_directory>/user/user/analysis</local_directory>
> <user_name>user</user_name>
> <group_name>supergroup</group_name>
> <is_directory>false</is_directory>
> <access_time>1417094143429</access_time>
> <is_under_construction>false</is_under_construction>
> <ds_quota>-1</ds_quota>
> <permission_status>user:supergroup:rw-r--r--</permission_status>
> <replication>2</replication>
> <disk_space_consumed>568440</disk_space_consumed>
> <preferred_block_size>134217728</preferred_block_size>
> </file>
> <replicas>
> <replica>
> <host_name>bad_node</host_name>
> <is_corrupt>false</is_corrupt>
> </replica>
> <replica>
> <host_name>good_node</host_name>
> <is_corrupt>false</is_corrupt>
> </replica>
> </replicas>
> </block_info>
> {code}
> fsck does not change the situation, ony the namenode restart force it to 
> recognize that block replica as missing.
> As a result it is required to restart namenode after each disk failure which 
> is not very practical...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to