[
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326651#comment-15326651
]
Daniel Zhi commented on YARN-4676:
----------------------------------
I was OOO and then busy on other project so only get time today to refresh the
patch. Basically the previous patch YARN-4676.014.patch has tried to address
all comments from Varun Vasudev, except logical conflicts with YARN-4311. The
new patch YARN-4676.015.patch resolved the conflicts. For Robert Kanter's
latest comments:
1. Conflict with YARN-4311 turns out not too bad --- I need update the tests
introduced by YARN-4311 to be compatible with graceful decommission behavior as
of YARN-4676. This fixed the TestResourceTrackerService errors.
2. I am not very sure the context of this point. The "earlier comments" link
lead to comments No 22 which is about separate timer for the poll, which was
addressed by the previous patch.
3. Done.
4. YARN-4676.014.patch on May 7 already removed the delay shutdown logic in
NodeManager as part of handling Varun Vasudev's comments. We will maintain a
private patch in EMR for the delayed shutdown, or may try to advocate it as a
feature in a new JIRA.
> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
> Key: YARN-4676
> URL: https://issues.apache.org/jira/browse/YARN-4676
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Daniel Zhi
> Assignee: Daniel Zhi
> Labels: features
> Attachments: GracefulDecommissionYarnNode.pdf,
> GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, YARN-4676.005.patch,
> YARN-4676.006.patch, YARN-4676.007.patch, YARN-4676.008.patch,
> YARN-4676.009.patch, YARN-4676.010.patch, YARN-4676.011.patch,
> YARN-4676.012.patch, YARN-4676.013.patch, YARN-4676.014.patch
>
>
> YARN-4676 implements an automatic, asynchronous and flexible mechanism to
> graceful decommission
> YARN nodes. After user issues the refreshNodes request, ResourceManager
> automatically evaluates
> status of all affected nodes to kicks out decommission or recommission
> actions. RM asynchronously
> tracks container and application status related to DECOMMISSIONING nodes to
> decommission the
> nodes immediately after there are ready to be decommissioned. Decommissioning
> timeout at individual
> nodes granularity is supported and could be dynamically updated. The
> mechanism naturally supports multiple
> independent graceful decommissioning “sessions” where each one involves
> different sets of nodes with
> different timeout settings. Such support is ideal and necessary for graceful
> decommission request issued
> by external cluster management software instead of human.
> DecommissioningNodeWatcher inside ResourceTrackingService tracks
> DECOMMISSIONING nodes status automatically and asynchronously after
> client/admin made the graceful decommission request. It tracks
> DECOMMISSIONING nodes status to decide when, after all running containers on
> the node have completed, will be transitioned into DECOMMISSIONED state.
> NodesListManager detect and handle include and exclude list changes to kick
> out decommission or recommission as necessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]