[jira] [Commented] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2022-01-24 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481354#comment-17481354
 ] 

Kevin Wikant commented on HDFS-16303:
-

My apologies for the delay, something high priority came up at work.
 * backport to Hadoop 2.x : https://github.com/apache/hadoop/pull/3920
 * backport to Hadoop 3.x : https://github.com/apache/hadoop/pull/3921

I have also made 1 small change to the DatanodeAdminDefaultMonitor based on a 
rare edge case I identified: https://github.com/apache/hadoop/pull/3923

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the 

[jira] [Commented] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2022-01-24 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480951#comment-17480951
 ] 

Akira Ajisaka commented on HDFS-16303:
--

Hi [~KevinWikant], how is this issue going?

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the failed node or force decommissioning or maintenance 
> by removing, calling refreshNodes, then re-adding to the excludes or host 
> config files.
> {quote}
> Ideally, the Namenode should be able to gracefully handle scenarios where the 
> datanodes in the "tracked.nodes" set are not making forward progress towards 

[jira] [Commented] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-23 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464588#comment-17464588
 ] 

Akira Ajisaka commented on HDFS-16303:
--

Thank you for your contribution, [~KevinWikant]. Would you provide PR for 
branch-3.3 and branch-3.2? The commit cannot be cherry-picked cleanly.

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the failed node or force decommissioning or maintenance 
> by removing, calling refreshNodes, then re-adding to the excludes or host 
> config files.
> {quote}
> Ideally, the Namenode should be able to gracefully