[
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404463#comment-15404463
]
Daniel Zhi commented on YARN-4676:
----------------------------------
I can clarify the scenarios:
1. DECOMMISSIONED->RUNNING this happens due to the RECOMMISSION event which is
triggered when the node is removed from exclude file (node can be dynamically
excluded or included). In the typical EMR cluster scenario, daemon like NM will
be configured to auto-start if killed/shutdown, however RM will reject such NM
if it appear in the exclude list.
2. Related to 1, DECOMMISSIONED NM, upon auto-restart, will try to register to
RM but will be rejected. It continue such loop until either: 1) the host being
terminated; 2) the host being recommissioned. It was likely the
DECOMMISSIONED->LOST transition is defensive coding --- without it invalid
event throws
3. CLEANUP_CONTAINER and CLEANUP_APP were for sure added to prevent otherwise
invalid event exception at the DECOMMISSIONED state.
So the core reason related to these transitions are related to DECOMMISSIONED
NMs are "active standby" (and could be RECOMMISSIONed without delay in any
moment) until the hosts being terminated in EMR scenario.
> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
> Key: YARN-4676
> URL: https://issues.apache.org/jira/browse/YARN-4676
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Daniel Zhi
> Assignee: Daniel Zhi
> Labels: features
> Attachments: GracefulDecommissionYarnNode.pdf,
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch,
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch,
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch,
> YARN-4676.012.patch, YARN-4676.013.patch, YARN-4676.014.patch,
> YARN-4676.015.patch, YARN-4676.016.patch, YARN-4676.017.patch,
> YARN-4676.018.patch, YARN-4676.019.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks
> DECOMMISSIONING nodes status automatically and asynchronously after
> client/admin made the graceful decommission request. It tracks
> DECOMMISSIONING nodes status to decide when, after all running containers on
> the node have completed, will be transitioned into DECOMMISSIONED state.
> NodesListManager detect and handle include and exclude list changes to kick
> out decommission or recommission as necessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]