[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087760#comment-14087760 ] Karthik Kambatla commented on YARN-2359: +1. Will commit this later today if no one objects. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087848#comment-14087848 ] Tsuyoshi OZAWA commented on YARN-2359: -- +1(non-binding), it looks good to me. Also ran tests and confirmed that it works. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088002#comment-14088002 ] Jian He commented on YARN-2359: --- [~zxu], thanks for working on it. I have a question: bq. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. where in the code is the IllegalArgumentException thrown ? Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088047#comment-14088047 ] zhihai xu commented on YARN-2359: - [~jianhe] The code is in pullNewlyAllocatedContainersAndNMTokens of SchedulerApplicationAttempt.java {code} try { // create container token and NMToken altogether. container.setContainerToken(rmContext.getContainerTokenSecretManager() .createContainerToken(container.getId(), container.getNodeId(), getUser(), container.getResource(), container.getPriority(), rmContainer.getCreationTime())); NMToken nmToken = rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(), getApplicationAttemptId(), container); if (nmToken != null) { nmTokens.add(nmToken); } } catch (IllegalArgumentException e) { // DNS might be down, skip returning this container. LOG.error(Error trying to assign container token and NM token to + an allocated container + container.getId(), e); continue; } {code} When IllegalArgumentException exception happened from createContainerToken, the code will skip the container. Then zero container is returned in amContainerAllocation. The following code in AMContainerAllocatedTransition in RMAppAttemptImpl.java will keep retry CONTAINER_ALLOCATED in SCHEDULED state. So IllegalArgumentException will cause zero container returned in amContainerAllocation, which will cause RMAppAttemptImpl stay at state RMAppAttemptState.SCHEDULED. {code} if (amContainerAllocation.getContainers().size() == 0) { appAttempt.retryFetchingAMContainer(appAttempt); return RMAppAttemptState.SCHEDULED; } {code} Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088161#comment-14088161 ] Jian He commented on YARN-2359: --- I see, thanks for your explanation. looks good to me too Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088479#comment-14088479 ] Karthik Kambatla commented on YARN-2359: Checking this in.. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086985#comment-14086985 ] zhihai xu commented on YARN-2359: - upload new patch to add comment in the unit test. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087117#comment-14087117 ] Hadoop QA commented on YARN-2359: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/1266/YARN-2359.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4526//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4526//console This message is automatically generated. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch, YARN-2359.002.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075531#comment-14075531 ] zhihai xu commented on YARN-2359: - I just added a unit test case (testAMCrashAtScheduled) in the patch to verify this state transition in RMAppAttempt state machine. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075537#comment-14075537 ] Hadoop QA commented on YARN-2359: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658009/YARN-2359.001.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4448//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4448//console This message is automatically generated. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch, YARN-2359.001.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074873#comment-14074873 ] Hadoop QA commented on YARN-2359: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657887/YARN-2359.000.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4436//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4436//console This message is automatically generated. Application is hung without timeout and retry after DNS/network is down. - Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2359.000.patch Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: .addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)