[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

Wangda Tan (JIRA) Thu, 25 Jan 2018 06:25:28 -0800

     [ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wangda Tan updated YARN-7790:
-----------------------------
    Attachment: YARN-7790.003.patch

> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --------------------------------------------------------------------------
>
>                 Key: YARN-7790
>                 URL: https://issues.apache.org/jira/browse/YARN-7790
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Sumana Sathish
>            Assignee: Wangda Tan
>            Priority: Critical
>         Attachments: YARN-7790.001.patch, YARN-7790.002.patch, 
> YARN-7790.003.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
>  - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched at an NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
>     createRetryPolicy(conf,
>       YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>       YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>       YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>       YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}
> The second part is not covered by the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures

Reply via email to