[ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-7790:
-----------------------------
    Attachment: YARN-7790.001.patch

> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --------------------------------------------------------------------------
>
>                 Key: YARN-7790
>                 URL: https://issues.apache.org/jira/browse/YARN-7790
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Critical
>         Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and in the same response, it will be sent back to NM. 
> Even though it is possible that NM crashes after the heartbeat, which causes 
> AM hangs for 10 mins. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs for long time. Discussed 
> with [~sunilg] , we need at least two fixes:
> When async scheduling enabled:
> 1) Skip node which missed X node heartbeat.
> 2) Kill AM container in ALLOCATED state on a node which missed Y node 
> heartbeat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to