Please check my inline comments to your queries. Hope I have answered all your 
questions…


Regards
Brahma Reddy Battula

From: Hariharan [mailto:hariharan...@gmail.com]
Sent: 15 November 2016 18:55
To: user@hadoop.apache.org
Subject: HDFS - Corrupt replicas preventing decommissioning?

Hello folks,
I'm running Apache Hadoop 2.6.0 and I'm seeing a weird problem where I keep 
seeing corrupt replicas. Example:
2016-11-15 06:42:38,104 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: 
blk_1073747320_231160{blockUCState=COMMITTED, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[[DISK]DS-11d5d492-a608-4bc0-9a04-048b8127bb32:NORMAL:10.0.8.185:50010|RBW]]},
 Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned 
replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this 
block: 10.0.8.185:50010<http://10.0.8.185:50010> 
10.0.8.148:50010<http://10.0.8.148:50010> 
10.0.8.149:50010<http://10.0.8.149:50010> , Current Datanode: 
10.0.8.185:50010<http://10.0.8.185:50010>, Is current datanode decommissioning: 
true
But I can't figure out which file this block belongs to - hadoop fsck / -files 
-blocks -locations | grep blk_1073747320_231160 returns nothing.
>> Looks files are open state, you can check fsck with –openforwrite option 
>> which will list all the open files also.
So I'm unable to delete the file and my concern is that this seems to be 
blocking decommissioning of my datanode (going on for ~18 hours now) since, 
looking at the code in BlockManager.java, we would not mark the DN as 
decommissioned if there are blocks with no live replicas on it.
My questions are:
1. What causes corrupt replicas and how to avoid them? I seem to be seeing 
these frequently:
(examples from prior runs)
>>As files are open state, there are chances blocks can be corrupt state since 
>>might not send block received command to Namenode.
So before going for decommission ensure that files are closed and check the 
under-replicated block count.

hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: 
blk_1074063633_2846521{blockUCState=COMMITTED, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[[DISK]DS-7b8e7b76-6066-43fb-8340-d93f7ab9c6ea:NORMAL:10.0.8.75:50010|RBW]]},
 Expected Replicas: 2, live replicas: 0, corrupt replicas: 4, decommissioned 
replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this 
block: 10.0.8.75:50010<http://10.0.8.75:50010> 
10.0.8.156:50010<http://10.0.8.156:50010> 
10.0.8.188:50010<http://10.0.8.188:50010> 
10.0.8.34:50010<http://10.0.8.34:50010> 10.0.8.74:50010<http://10.0.8.74:50010> 
, Current Datanode: 10.0.8.75:50010<http://10.0.8.75:50010>, Is current 
datanode decommissioning: true
hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: 
blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]},
 Expected Replicas: 2, live replicas: 0, corrupt replicas: 3, decommissioned 
replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this 
block: 10.0.8.153:50010<http://10.0.8.153:50010> 
10.0.8.74:50010<http://10.0.8.74:50010> 10.0.8.7:50010<http://10.0.8.7:50010> 
10.0.8.198:50010<http://10.0.8.198:50010> , Current Datanode: 
10.0.8.153:50010<http://10.0.8.153:50010>, Is current datanode decommissioning: 
true
hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: 
blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, 
replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]},
 Expected Replicas: 2, live replicas: 0, corrupt replicas: 3, decommissioned 
replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this 
block: 10.0.8.153:50010<http://10.0.8.153:50010> 
10.0.8.74:50010<http://10.0.8.74:50010> 10.0.8.7:50010<http://10.0.8.7:50010> 
10.0.8.198:50010<http://10.0.8.198:50010> , Current Datanode: 
10.0.8.7:50010<http://10.0.8.7:50010>, Is current datanode decommissioning: true
2. Is this possibly a JIRA that's fixed in recent versions (I realize I'm 
running a very old version)?
>> Based on the exact root cause for corrupt, we can able to tell jira 
>> Id’s..Need to check all of your logs.
3. Anything I can do to "force" decommissioning of such nodes (apart from 
forcefully terminating them)?
>> As of now no “forceful” decommission. But you can delete the corrupt blocks 
>> using  “hdfs fsck delete <filePath>”
Thanks,
Hari



Reply via email to