[ https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wangda Tan updated YARN-7790: ----------------------------- Attachment: YARN-7790.003.patch > Improve Capacity Scheduler Async Scheduling to better handle node failures > -------------------------------------------------------------------------- > > Key: YARN-7790 > URL: https://issues.apache.org/jira/browse/YARN-7790 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Sumana Sathish > Assignee: Wangda Tan > Priority: Critical > Attachments: YARN-7790.001.patch, YARN-7790.002.patch, > YARN-7790.003.patch > > > This is not a new issue but async scheduling makes it worse: > In sync scheduling, if an AM container allocated to a node, it assumes node > just heartbeat to RM, and AM launcher will connect NM to launch the > container. Even though it is possible that NM crashes after the heartbeat, > which causes AM hangs for a while. But it is related rare. > In async scheduling world, multiple AM containers can be placed on a > problematic NM, which could cause application hangs easily. Discussed with > [~sunilg] and [~jianhe] , we need one fix: > When async scheduling enabled: > - Skip node which missed X node heartbeat. > And in addition, it's better to reduce wait time by setting following configs > to earlier fail a container being launched at an NM with connectivity issue. > {code:java} > RetryPolicy retryPolicy = > createRetryPolicy(conf, > YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS, > YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS, > YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS, > YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS); > {code} > The second part is not covered by the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org