[
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339273#comment-16339273
]
Wangda Tan commented on YARN-7790:
----------------------------------
[~sunilg], my bad, just uploaded ver.3 patch.
> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --------------------------------------------------------------------------
>
> Key: YARN-7790
> URL: https://issues.apache.org/jira/browse/YARN-7790
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Sumana Sathish
> Assignee: Wangda Tan
> Priority: Critical
> Attachments: YARN-7790.001.patch, YARN-7790.002.patch,
> YARN-7790.003.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node
> just heartbeat to RM, and AM launcher will connect NM to launch the
> container. Even though it is possible that NM crashes after the heartbeat,
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a
> problematic NM, which could cause application hangs easily. Discussed with
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
> - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs
> to earlier fail a container being launched at an NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
> createRetryPolicy(conf,
> YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
> YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
> YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
> YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}
> The second part is not covered by the patch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]