[ 
https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343519#comment-16343519
 ] 

Hudson commented on YARN-7790:
------------------------------

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13575 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13575/])
YARN-7790. Improve Capacity Scheduler Async Scheduling to better handle 
(sunilg: rev e9c72d04beddfe0252d2e81123a9fe66bdf04078)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHAForAsyncScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerAsyncScheduling.java


> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --------------------------------------------------------------------------
>
>                 Key: YARN-7790
>                 URL: https://issues.apache.org/jira/browse/YARN-7790
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Sumana Sathish
>            Assignee: Wangda Tan
>            Priority: Critical
>             Fix For: 3.1.0, 3.0.1
>
>         Attachments: YARN-7790.001.patch, YARN-7790.002.patch, 
> YARN-7790.003.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node 
> just heartbeat to RM, and AM launcher will connect NM to launch the 
> container. Even though it is possible that NM crashes after the heartbeat, 
> which causes AM hangs for a while. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a 
> problematic NM, which could cause application hangs easily. Discussed with 
> [~sunilg] and [~jianhe] , we need one fix:
> When async scheduling enabled:
>  - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs 
> to earlier fail a container being launched at an NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
>     createRetryPolicy(conf,
>       YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>       YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>       YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>       YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}
> The second part is not covered by the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to