Wangda Tan created YARN-7790:
--------------------------------

             Summary: Improve Capacity Scheduler Async Scheduling to better 
handle node failures
                 Key: YARN-7790
                 URL: https://issues.apache.org/jira/browse/YARN-7790
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Wangda Tan
            Assignee: Wangda Tan


This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node 
just heartbeat to RM, and in the same response, it will be sent back to NM. 
Even though it is possible that NM crashes after the heartbeat, which causes AM 
hangs for 10 mins. But it is related rare.

In async scheduling world, multiple AM containers can be placed on a 
problematic NM, which could cause application hangs for long time. Discussed 
with [~sunilg] , we need at least two fixes:

When async scheduling enabled:
1) Skip node which missed X node heartbeat.
2) Kill AM container in ALLOCATED state on a node which missed Y node heartbeat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to