Please check my inline comments to your queries. Hope I have answered all your questions…
Regards Brahma Reddy Battula From: Hariharan [mailto:hariharan...@gmail.com] Sent: 15 November 2016 18:55 To: user@hadoop.apache.org Subject: HDFS - Corrupt replicas preventing decommissioning? Hello folks, I'm running Apache Hadoop 2.6.0 and I'm seeing a weird problem where I keep seeing corrupt replicas. Example: 2016-11-15 06:42:38,104 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1073747320_231160{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-11d5d492-a608-4bc0-9a04-048b8127bb32:NORMAL:10.0.8.185:50010|RBW]]}, Expected Replicas: 2, live replicas: 0, corrupt replicas: 2, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.185:50010<http://10.0.8.185:50010> 10.0.8.148:50010<http://10.0.8.148:50010> 10.0.8.149:50010<http://10.0.8.149:50010> , Current Datanode: 10.0.8.185:50010<http://10.0.8.185:50010>, Is current datanode decommissioning: true But I can't figure out which file this block belongs to - hadoop fsck / -files -blocks -locations | grep blk_1073747320_231160 returns nothing. >> Looks files are open state, you can check fsck with –openforwrite option >> which will list all the open files also. So I'm unable to delete the file and my concern is that this seems to be blocking decommissioning of my datanode (going on for ~18 hours now) since, looking at the code in BlockManager.java, we would not mark the DN as decommissioned if there are blocks with no live replicas on it. My questions are: 1. What causes corrupt replicas and how to avoid them? I seem to be seeing these frequently: (examples from prior runs) >>As files are open state, there are chances blocks can be corrupt state since >>might not send block received command to Namenode. So before going for decommission ensure that files are closed and check the under-replicated block count. hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1074063633_2846521{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-7b8e7b76-6066-43fb-8340-d93f7ab9c6ea:NORMAL:10.0.8.75:50010|RBW]]}, Expected Replicas: 2, live replicas: 0, corrupt replicas: 4, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.75:50010<http://10.0.8.75:50010> 10.0.8.156:50010<http://10.0.8.156:50010> 10.0.8.188:50010<http://10.0.8.188:50010> 10.0.8.34:50010<http://10.0.8.34:50010> 10.0.8.74:50010<http://10.0.8.74:50010> , Current Datanode: 10.0.8.75:50010<http://10.0.8.75:50010>, Is current datanode decommissioning: true hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, live replicas: 0, corrupt replicas: 3, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010<http://10.0.8.153:50010> 10.0.8.74:50010<http://10.0.8.74:50010> 10.0.8.7:50010<http://10.0.8.7:50010> 10.0.8.198:50010<http://10.0.8.198:50010> , Current Datanode: 10.0.8.153:50010<http://10.0.8.153:50010>, Is current datanode decommissioning: true hadoop-hdfs-namenode-ip-10-0-8-199.log.9:2016-11-13 23:54:57,513 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Block: blk_1073975974_2185091{blockUCState=COMMITTED, primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[[DISK]DS-b9b8b191-f8c8-49b0-b4c1-b2a9ce6b9ee8:NORMAL:10.0.8.153:50010|RBW]]}, Expected Replicas: 2, live replicas: 0, corrupt replicas: 3, decommissioned replicas: 1, excess replicas: 0, Is Open File: true, Datanodes having this block: 10.0.8.153:50010<http://10.0.8.153:50010> 10.0.8.74:50010<http://10.0.8.74:50010> 10.0.8.7:50010<http://10.0.8.7:50010> 10.0.8.198:50010<http://10.0.8.198:50010> , Current Datanode: 10.0.8.7:50010<http://10.0.8.7:50010>, Is current datanode decommissioning: true 2. Is this possibly a JIRA that's fixed in recent versions (I realize I'm running a very old version)? >> Based on the exact root cause for corrupt, we can able to tell jira >> Id’s..Need to check all of your logs. 3. Anything I can do to "force" decommissioning of such nodes (apart from forcefully terminating them)? >> As of now no “forceful” decommission. But you can delete the corrupt blocks >> using “hdfs fsck delete <filePath>” Thanks, Hari