[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087117#comment-14087117 ]
Hadoop QA commented on YARN-2359: --------------------------------- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660000/YARN-2359.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4526//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4526//console This message is automatically generated. > Application is hung without timeout and retry after DNS/network is down. > ------------------------------------------------------------------------- > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > Attachments: YARN-2359.000.patch, YARN-2359.001.patch, > YARN-2359.002.patch > > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated when the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > RMAppAttemptState.SCHEDULED > add the following code in StateMachineFactory: > {code}.addTransition(RMAppAttemptState.SCHEDULED, > RMAppAttemptState.FINAL_SAVING, > RMAppAttemptEventType.CONTAINER_FINISHED, > new FinalSavingTransition( > new AMContainerCrashedBeforeRunningTransition(), > RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)