Hello, We're seeing some unusual behavior in our two HDFS 2.6.0 (CDH5.11.1) clusters and was wondering if you could help. When we failover our Namenodes we observe a large number of PendingDeletionBlocks blocks - i.e., the metric is zero before failover and several thousand after.
This seems different to the PostponedMisreplicatedBlocks [1] (expected before all the datanodes have sent their block reports to the new active namenode and the number of NumStaleStorages is zero) - we see that metric become zero once all the block reports have been received. What we're seeing is that PendingDeletionBlocks increases immediately after failover, when NumStaleStorages is ~equal to the number of datanodes in the cluster. The amount of extra space used is a problem as we have to increase our cluster size to accommodate these blocks until the Namenodes are failed-over. We've checked the debug logs, metasave report, and other jmx metrics and everything appears fine before we fail-over - apart from the amount of dfs used growing then decreasing. We can't find anything obviously wrong with the HDFS configuration, HA setup, etc. Any help on where to look/debug next would be appreciated. Thanks, Michael. [1] https://github.com/cloudera/hadoop-common/blob/cdh5-2.6.0_5. 11.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apach e/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3047 --