[
https://issues.apache.org/jira/browse/YARN-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun Suresh updated YARN-5465:
------------------------------
Target Version/s: 3.1.0 (was: 2.9.0)
> Server-Side NM Graceful Decommissioning subsequent call behavior
> ----------------------------------------------------------------
>
> Key: YARN-5465
> URL: https://issues.apache.org/jira/browse/YARN-5465
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: graceful
> Reporter: Robert Kanter
>
> The Server-Side NM Graceful Decommissioning feature added by YARN-4676 has
> the following behavior when subsequent calls are made:
> # Start a long-running job that has containers running on nodeA
> # Add nodeA to the exclude file
> # Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully
> decommissioning nodeA
> # Wait 30 seconds
> # Add nodeB to the exclude file
> # Run {{-refreshNodes -g 30 -server}} (30sec)
> # After 30 seconds, both nodeA and nodeB shut down
> In a nutshell, issuing a subsequent call to gracefully decommission nodes
> updates the timeout for any currently decommissioning nodes. This makes it
> impossible to gracefully decommission different sets of nodes with different
> timeouts. Though it does let you easily update the timeout of currently
> decommissioning nodes.
> Another behavior we could do is this:
> # {color:grey}Start a long-running job that has containers running on nodeA
> # {color:grey}Add nodeA to the exclude file{color}
> # {color:grey}Run {{-refreshNodes -g 120 -server}} (2min) to begin gracefully
> decommissioning nodeA{color}
> # {color:grey}Wait 30 seconds{color}
> # {color:grey}Add nodeB to the exclude file{color}
> # {color:grey}Run {{-refreshNodes -g 30 -server}} (30sec){color}
> # After 30 seconds, nodeB shuts down
> # After 60 more seconds, nodeA shuts down
> This keeps the nodes affected by each call to gracefully decommission nodes
> independent. You can now have different sets of decommissioning nodes with
> different timeouts. However, to update the timeout of a currently
> decommissioning node, you'd have to first recommission it, and then
> decommission it again.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]