[
https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242634#comment-16242634
]
ASF GitHub Bot commented on YARN-6483:
--------------------------------------
GitHub user juanrh opened a pull request:
https://github.com/apache/hadoop/pull/289
[YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of
updated nodes returned by the Resource Manager as a response to the Application
Master heartbeat
This is an alternative approach to
https://issues.apache.org/jira/browse/YARN-3224 for notifying all affected
application masters when a node transitions into the DECOMMISSIONING state.
This change modifies the AllocateResponse that the YARN Resource Manager uses
to respond to heartbeat request from application masters, to add any node that
has transitioned to DECOMMISSIONING state since the last heartbeat to the list
of NodeReport objects that is part of the AllocateResponse object. We also add
a new field to each NodeReport to add the decommission timeout for
DECOMMISSIONING nodes, thus covering the same functionality of the original
proposal in YARN-3224.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/juanrh/hadoop hortala-YARN-6483
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/hadoop/pull/289.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #289
----
commit c85b7012b90fff2cf323009664b14db1e998408a
Author: Juan Rodriguez Hortala <[email protected]>
Date: 2017-10-27T23:59:57Z
Notify affected Application Masters when a node enters DECOMMISSIONING state
commit ed2561e3bfb821f6b981ac5e76761c7ae8475e01
Author: Juan Rodriguez Hortala <[email protected]>
Date: 2017-11-02T00:41:27Z
Add decommission timeout field to NodeReport
commit 506f7defbf217c6b8b525a90604e2484573bba8d
Author: Juan Rodriguez Hortala <[email protected]>
Date: 2017-11-02T01:19:49Z
fix TestClientRMService
adapt test to decommission timeout checks being independent
of received heartbeats
commit 9a82f314feca6e06e067ceb1b64e0e4e4fad3882
Author: Juan Rodriguez Hortala <[email protected]>
Date: 2017-11-02T18:07:29Z
use xml format for excludes files with timeouts
in this version that is the only way to specify a timeout in the
excludes file
commit e17bc8c132ae3fb70fe273e6bee12956142b5345
Author: Juan Rodriguez Hortala <[email protected]>
Date: 2017-11-02T20:41:53Z
load dynamic timeout like in hadoop trunk
replace dynamic conf by using the configuration
passed by AdminService
cr https://cr.amazon.com/r/7919988/
----
> Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes
> returned by the Resource Manager as a response to the Application Master
> heartbeat
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-6483
> URL: https://issues.apache.org/jira/browse/YARN-6483
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Juan RodrĂguez Hortalá
> Attachments: YARN-6483-v1.patch
>
>
> The DECOMMISSIONING node state is currently used as part of the graceful
> decommissioning mechanism to give time for tasks to complete in a node that
> is scheduled for decommission, and for reducer tasks to read the shuffle
> blocks in that node. Also, YARN effectively blacklists nodes in
> DECOMMISSIONING state by assigning them a capacity of 0, to prevent
> additional containers to be launched in those nodes, so no more shuffle
> blocks are written to the node. This blacklisting is not effective for
> applications like Spark, because a Spark executor running in a YARN container
> will keep receiving more tasks after the corresponding node has been
> blacklisted at the YARN level. We would like to propose a modification of the
> YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added
> to the list of updated nodes returned by the Resource Manager as a response
> to the Application Master heartbeat. This way a Spark application master
> would be able to blacklist a DECOMMISSIONING at the Spark level.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]