[
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wangda Tan updated YARN-7790:
-----------------------------
Reporter: Sumana Sathish (was: Wangda Tan)
> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --------------------------------------------------------------------------
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Sumana Sathish
> Assignee: Wangda Tan
> Priority: Critical
> Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node
> just heartbeat to RM, and in the same response, it will be sent back to NM.
> Even though it is possible that NM crashes after the heartbeat, which causes
> AM hangs for 10 mins. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a
> problematic NM, which could cause application hangs for long time. Discussed
> with [~sunilg] , we need at least two fixes:
> When async scheduling enabled:
> 1) Skip node which missed X node heartbeat.
> 2) Kill AM container in ALLOCATED state on a node which missed Y node
> heartbeat.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]