[ 
https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242634#comment-16242634
 ] 

ASF GitHub Bot commented on YARN-6483:
--------------------------------------

GitHub user juanrh opened a pull request:

    https://github.com/apache/hadoop/pull/289

    [YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of 
updated nodes returned by the Resource Manager as a response to the Application 
Master heartbeat

    This is an alternative approach to 
https://issues.apache.org/jira/browse/YARN-3224 for notifying all affected 
application masters when a node transitions into the DECOMMISSIONING state. 
This change modifies the AllocateResponse that the YARN Resource Manager uses 
to respond to heartbeat request from application masters, to add any node that 
has transitioned to DECOMMISSIONING state since the last heartbeat to the list 
of NodeReport objects that is part of the AllocateResponse object. We also add 
a new field to each NodeReport to add the decommission timeout for 
DECOMMISSIONING nodes, thus covering the same functionality of the original 
proposal in YARN-3224.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/juanrh/hadoop hortala-YARN-6483

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hadoop/pull/289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #289
    
----
commit c85b7012b90fff2cf323009664b14db1e998408a
Author: Juan Rodriguez Hortala <[email protected]>
Date:   2017-10-27T23:59:57Z

    Notify affected Application Masters when a node enters DECOMMISSIONING state

commit ed2561e3bfb821f6b981ac5e76761c7ae8475e01
Author: Juan Rodriguez Hortala <[email protected]>
Date:   2017-11-02T00:41:27Z

    Add decommission timeout field to NodeReport

commit 506f7defbf217c6b8b525a90604e2484573bba8d
Author: Juan Rodriguez Hortala <[email protected]>
Date:   2017-11-02T01:19:49Z

    fix TestClientRMService
    
    adapt test to decommission timeout checks being independent
    of received heartbeats

commit 9a82f314feca6e06e067ceb1b64e0e4e4fad3882
Author: Juan Rodriguez Hortala <[email protected]>
Date:   2017-11-02T18:07:29Z

    use xml format for excludes files with timeouts
    
    in this version that is the only way to specify a timeout in the
    excludes file

commit e17bc8c132ae3fb70fe273e6bee12956142b5345
Author: Juan Rodriguez Hortala <[email protected]>
Date:   2017-11-02T20:41:53Z

    load dynamic timeout like in hadoop trunk
    
    replace dynamic conf by using the configuration
    passed by AdminService
    
    cr https://cr.amazon.com/r/7919988/

----


> Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes 
> returned by the Resource Manager as a response to the Application Master 
> heartbeat
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6483
>                 URL: https://issues.apache.org/jira/browse/YARN-6483
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Juan Rodríguez Hortalá
>         Attachments: YARN-6483-v1.patch
>
>
> The DECOMMISSIONING node state is currently used as part of the graceful 
> decommissioning mechanism to give time for tasks to complete in a node that 
> is scheduled for decommission, and for reducer tasks to read the shuffle 
> blocks in that node. Also, YARN effectively blacklists nodes in 
> DECOMMISSIONING state by assigning them a capacity of 0, to prevent 
> additional containers to be launched in those nodes, so no more shuffle 
> blocks are written to the node. This blacklisting is not effective for 
> applications like Spark, because a Spark executor running in a YARN container 
> will keep receiving more tasks after the corresponding node has been 
> blacklisted at the YARN level. We would like to propose a modification of the 
> YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added 
> to the list of updated nodes returned by the Resource Manager as a response 
> to the Application Master heartbeat. This way a Spark application master 
> would be able to blacklist a DECOMMISSIONING at the Spark level.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to