[ https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wangda Tan updated YARN-7790: ----------------------------- Attachment: YARN-7790.001.patch > Improve Capacity Scheduler Async Scheduling to better handle node failures > -------------------------------------------------------------------------- > > Key: YARN-7790 > URL: https://issues.apache.org/jira/browse/YARN-7790 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Wangda Tan > Assignee: Wangda Tan > Priority: Critical > Attachments: YARN-7790.001.patch > > > This is not a new issue but async scheduling makes it worse: > In sync scheduling, if an AM container allocated to a node, it assumes node > just heartbeat to RM, and in the same response, it will be sent back to NM. > Even though it is possible that NM crashes after the heartbeat, which causes > AM hangs for 10 mins. But it is related rare. > In async scheduling world, multiple AM containers can be placed on a > problematic NM, which could cause application hangs for long time. Discussed > with [~sunilg] , we need at least two fixes: > When async scheduling enabled: > 1) Skip node which missed X node heartbeat. > 2) Kill AM container in ALLOCATED state on a node which missed Y node > heartbeat. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org