[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051339#comment-14051339 ] Hudson commented on YARN-2065: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #602 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/602/]) YARN-2065 AM cannot create new containers after restart (stevel: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607441) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Fix For: 2.5.0 Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051472#comment-14051472 ] Hudson commented on YARN-2065: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1793 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1793/]) YARN-2065 AM cannot create new containers after restart (stevel: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607441) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Fix For: 2.5.0 Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051567#comment-14051567 ] Hudson commented on YARN-2065: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1820 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1820/]) YARN-2065 AM cannot create new containers after restart (stevel: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607441) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Fix For: 2.5.0 Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050272#comment-14050272 ] Hadoop QA commented on YARN-2065: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653606/YARN-2065-003.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4178//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4178//console This message is automatically generated. AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050286#comment-14050286 ] Jian He commented on YARN-2065: --- thanks for the testing, Steve! AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050521#comment-14050521 ] Steve Loughran commented on YARN-2065: -- With Jenkins happy, I'm +1 on this patch; it fixes what it says it does AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050562#comment-14050562 ] Hudson commented on YARN-2065: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5808 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5808/]) YARN-2065 AM cannot create new containers after restart (stevel: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607441) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Fix For: 2.5.0 Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047111#comment-14047111 ] Steve Loughran commented on YARN-2065: -- I'll try to run my code against this patch this week AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047171#comment-14047171 ] Hadoop QA commented on YARN-2065: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653066/YARN-2065-002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4134//console This message is automatically generated. AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008045#comment-14008045 ] Jian He commented on YARN-2065: --- Also changed authorizeGetAndStopContainerRequest to check against appId. bq. The token is generated with the previous container's attempt Id, instead of the current attemptId. This actually should not be a problem after changing the two methods to check against appId instead of attemptId. AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999094#comment-13999094 ] Jian He commented on YARN-2065: --- Looked at the exception posted in SLIDER-34, the problem is that AM can get new containers from RM, but cannot launch the containers on NM because of the following method. The token is generated with the previous container's attempt Id, instead of the current attemptId. And NM is checking the attemptId from NMToken against the attemptId from the container. {code} public NMToken createAndGetNMToken(String applicationSubmitter, ApplicationAttemptId appAttemptId, Container container) { try { this.readLock.lock(); HashSetNodeId nodeSet = this.appAttemptToNodeKeyMap.get(appAttemptId); NMToken nmToken = null; if (nodeSet != null) { if (!nodeSet.contains(container.getNodeId())) { LOG.info(Sending NMToken for nodeId : + container.getNodeId() + for container : + container.getId()); Token token = createNMToken(**container.getId().getApplicationAttemptId()**, container.getNodeId(), applicationSubmitter); nmToken = NMToken.newInstance(container.getNodeId(), token); nodeSet.add(container.getNodeId()); } } return nmToken; } finally { this.readLock.unlock(); } } {code} Changing this method will fix this problem. But another problem is that ContainerMangerImpl#authorizeGetAndStopContainerRequest also requires the previous NMToken to talk to the previous container and current NMToken to talk with current container. Luckily, it's now not throwing exception but just log error messages. we also need to change the NM side to check against the applicationId rather than attemptId. AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)