[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043033#comment-14043033 ] Jian He commented on YARN-1365: --- looks good, committing ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.008.patch, YARN-1365.009.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043039#comment-14043039 ] Jian He commented on YARN-1365: --- opened YARN-2209 to use exceptions in allocate call instead of AM resync command to indicate AM to re-register. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Fix For: 2.5.0 Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.008.patch, YARN-1365.009.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043042#comment-14043042 ] Hudson commented on YARN-1365: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5777 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5777/]) YARN-1365. Changed ApplicationMasterService to allow an app to re-register after RM restart. Contributed by Anubhav Dhoot (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1605263) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ApplicationMasterNotRegisteredException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/InvalidApplicationMasterRequestException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAttemptAddedSchedulerEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterLauncher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Fix For: 2.5.0 Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.008.patch, YARN-1365.009.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041267#comment-14041267 ] Hadoop QA commented on YARN-1365: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652034/YARN-1365.008.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4054//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4054//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041355#comment-14041355 ] Anubhav Dhoot commented on YARN-1365: - The changes for addApplication caused the failures. I am going to open a separate jira to fix that as per Jian's suggest and undo those changes here. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041433#comment-14041433 ] Hadoop QA commented on YARN-1365: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652074/YARN-1365.008.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4056//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4056//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.008.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041506#comment-14041506 ] Jian He commented on YARN-1365: --- we can revert RMAppImpl changes also ? ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.008.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041600#comment-14041600 ] Hadoop QA commented on YARN-1365: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652104/YARN-1365.009.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4057//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4057//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.008.patch, YARN-1365.008.patch, YARN-1365.009.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039222#comment-14039222 ] Jian He commented on YARN-1365: --- looks good overall, just some minor comments: This is audit logger removed. I think we should have it back and adjust it for resync, given finishApplicationMaster call also retains it. {code} -RMAuditLogger.logFailure( -this.rmContext.getRMApps().get(appAttemptId.getApplicationId()) -.getUser(), AuditConstants.REGISTER_AM, , -ApplicationMasterService, message, -appAttemptId.getApplicationId(), -appAttemptId); {code} Found code comment “org.apache.hadoop.yarn.api.protocolrecords.RegisterApplicationMasterRequest” in the ApplicationMasterNotRegisteredException is too long and unreadable.. can you change it to use RegisterApplicationMasterRequest instead? thx testRMRestartWorkPreservingAppReregister- testAppReregisterOnRMWorkPreservingRestart ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.007.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037791#comment-14037791 ] Jian He commented on YARN-1365: --- Thanks for updating the patch 1.how about UnregisteredApplicationMasterException - ApplicationMasterNotRegisteredException ? please also add comments that this exception can happen even if application has registered before because RM may have restarted and the expectation to handle this exception is to re-register. 2.This newly added constructor is not used anywhere? we can just use “app.handler.handle” to send the scheduler event in RMAppRecoverdTransition instead of refactoring the transition. {code} public void transition(RMAppImpl app, RMAppEvent event, boolean shouldSchedulerNotifyAppAdded) { transitionImplementation(app, event, shouldSchedulerNotifyAppAdded); } {code} 3. the following code format in FifoScheduler can be consolidated to 2 lines. {code} public synchronized void addApplication(ApplicationId applicationId, String queue, String user, boolean shouldNotifyAppAccepted) { {code} 4. some minor comments on testRMRestartWorkPreservingAppReregister: this conf.set is not needed, it’s already enabled globally. {code} conf.setBoolean(YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_ENABLED, true); {code} We can use MockRM.launchAndRegisterAM instead of changing TestRMRestart.launchAM to be static {code} MockAM am0 = TestRMRestart.launchAM(app0, rm1, nm1); {code} If using the global variable rm1,rm2, the following two statements are not needed. {code} rm1.stop(); rm2.stop(); {code} ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.006.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030874#comment-14030874 ] Anubhav Dhoot commented on YARN-1365: - Hi [~jianhe] I addressed all your comments except we can print the current state of RMAppAttempt also which will be useful for debugging There is no easy way to get to RMAppAttempt at that point. i dont want to add a dependancy on it just for logging. Let me know if you think there is an easy way to get to it. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030955#comment-14030955 ] Hadoop QA commented on YARN-1365: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650337/YARN-1365.005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3981//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3981//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14030979#comment-14030979 ] Hadoop QA commented on YARN-1365: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650340/YARN-1365.005.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3983//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3983//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031248#comment-14031248 ] Jian He commented on YARN-1365: --- Thanks for updating the patch. The debug logging can be wrapped with isDebugEnabled condition {code} LOG.debug(Skipping notifying ATTEMPT_ADDED); {code} The following code is removed, but schedulers#addApplication are not handling the case to not send app_accepted events as we do for addApplicationAttempt. My point was we can do the same for both addApplication and addApplicationAttempt to not send dup events. Given this is not relevant to this patch itself, we can fix this separately if needed. {code} // ACCECPTED state can once again receive APP_ACCEPTED event, because on // recovery the app returns ACCEPTED state and the app once again go // through the scheduler and triggers one more APP_ACCEPTED event at // ACCEPTED state. .addTransition(RMAppState.ACCEPTE {code} This transition can never happen ? given that unregister also has to do resync. {code} .addTransition(RMAppAttemptState.LAUNCHED, EnumSet.of(RMAppAttemptState.FINAL_SAVING, RMAppAttemptState.FINISHED), RMAppAttemptEventType.UNREGISTERED, new AMUnregisteredTransition()) {code} This piece of code is not needed, the previous launchAM internally checks the app state already. We can use MockRM.launchAndRegisterAM alternatively. The test case can be moved to TestWorkPreservingRMRestart {code} nm1.nodeHeartbeat(am0.getApplicationAttemptId(), 1, ContainerState.RUNNING); am0.waitForState(RMAppAttemptState.RUNNING); rm1.waitForState(app0.getApplicationId(), RMAppState.RUNNING); {code} *Just thinking*: Does it make sense to map AMCommand(shutdown, resync) to corresponding exceptions? The benefit is that we don’t need to add extra fields in AMS protocol response and user not using AMRMClient will be forced to handle such condition to work with RM restart. thoughts? ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031268#comment-14031268 ] Anubhav Dhoot commented on YARN-1365: - Agreed. I was trying to be consistent with allocateresonse, but would prefer exceptions. AM Client will discover it automatically instead of being hidden in a return value. I would prefer if allocateresponse would also use exceptions instead of AM commands. I can open a Jira for it Will address your other comments as well. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, YARN-1365.005.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028139#comment-14028139 ] Tsuyoshi OZAWA commented on YARN-1365: -- [~jianhe], [~adhoot], yes, +1 for keeping the semantics for simplicity. AM should re-issue the RPC {{finishAppication()}} again when RM returns Resync, and it's consistent behavior before and after the change. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027012#comment-14027012 ] Jian He commented on YARN-1365: --- Anubhav, thanks for working on the patch. some comments: The following two pieces of code can be merged in the same if block like this ? {code} // Need to register an app attempt before AM can register appAttempt.masterService .registerAppAttempt(appAttempt.applicationAttemptId); appAttempt.eventHandler.handle(new AppAttemptAddedSchedulerEvent( appAttempt.getAppAttemptId(), false, false)); {code} we can print the current state of RMAppAttempt also which will be useful for debugging. {code} LOG.info(Skipping notifying ATTEMPT_ADDED); {code} We should remove the following transition also and do the same to pass a flag in AppAddedSchedulerEvent to notify not re-send APP_ACCEPTED event for recovered apps. {code} // ACCECPTED state can once again receive APP_ACCEPTED event, because on // recovery the app returns ACCEPTED state and the app once again go // through the scheduler and triggers one more APP_ACCEPTED event at // ACCEPTED state. .addTransition(RMAppState.ACCEPTED, RMAppState.ACCEPTED, RMAppEventType.APP_ACCEPTED) {code} This message is not exactly true. If RM didn’t reboot and AM can just do allocate without registration. {code} String message = Application Master is not registered for known application: + appAttemptId.getApplicationId() + . Looks like RM rebooted. Let AM resync.; {code} The following code is removed in finishApplicationMaster to not throw exception if the app did not register before. If an app did not register and do unregister directly, it’ll go through successfully. should we return resync command here also and client is expected to do re-register and then unregister? {code} if (!hasApplicationMasterRegistered(applicationAttemptId)) { String message = Application Master is trying to unregister before registering for: + applicationAttemptId.getApplicationId(); LOG.error(message); RMAuditLogger.logFailure( this.rmContext.getRMApps() .get(applicationAttemptId.getApplicationId()).getUser(), AuditConstants.UNREGISTER_AM, , ApplicationMasterService, message, applicationAttemptId.getApplicationId(), applicationAttemptId); throw new InvalidApplicationMasterRequestException(message); } {code} ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027118#comment-14027118 ] Anubhav Dhoot commented on YARN-1365: - If an app did not register and do unregister directly, it’ll go through successfully. should we return resync command here also and client is expected to do re-register and then unregister? As per discussion in YARN-1366 it seems like we wanted to allow unregister without register and this change was part of demonstrating that. The problems I can see are we are missing information like AM host, port thats provided only during register. Not sure if we need this once the attempt is finishing. I can revert this back to resync if we think we should not be breaking this or for any other issue. resync does make this simpler and I would vote for that. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027187#comment-14027187 ] Bikas Saha commented on YARN-1365: -- Sounds like the right approach. Keeps things consistent. Allowing unregister without register (while sounding harmless by itself) would need changes in the state machine to support and also breaks the existing contract that even an empty application needs to at least call register and unregister or its considered failed. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027189#comment-14027189 ] Jian He commented on YARN-1365: --- Doing resync also for unregister is to keep the semantics that application should not be allowed to unregister if it did not registered before. [~bikassaha], [~vinodkv], [~rohithsharma], [~ozawa] Do we agree on doing this ? ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019832#comment-14019832 ] Anubhav Dhoot commented on YARN-1365: - I will fix it here. I will be posting a fix shorty. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019574#comment-14019574 ] Jian He commented on YARN-1365: --- bq. The option is see is we pass in a flag to AppAttemptAddedSchedulerEvent that tells scheduler not to issue ATTEMPT_ADDED. Makes sense. Anubhav, do you want to comment on YARN-1368 so that I can fix it or you want to include the fix here? ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013207#comment-14013207 ] Vinod Kumar Vavilapalli commented on YARN-1365: --- bq. Can we please consolidate YARN-1366 and YARN-1365 into one JIRA? I was asked to elaborate offline. The titles are - YARN-1365: ApplicationMasterService to allow Register and Unregister of an app that was running before restart - YARN-1366: ApplicationMasterService should Resync with the AM upon allocate call after restart I haven't looked at the set of patches, but they seem like either they are two different solutions to the same problem or, if not, they will likely conflict a lot in terms of code changes. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007024#comment-14007024 ] Rohith commented on YARN-1365: -- Hi Anubhav, One comment on the patch. * Notifying to scheduler for APP_ATTEMPT_ADDED is in RMApp lead to InvalidStateTranstion exception for RMAppAttept. Can this handle at RMAppAtteptImpl#AttemptRecoveredTransition?. Since during recovery of RMApp, all attempt are recovered in synchronously , so RMAppAttempt state is moved to LAUNCHED before notifying to scheduler. {noformat} // Let scheduler know about this attempt so it can allow AM to register boolean disableTransferState = false; app.handler.handle(new AppAttemptAddedSchedulerEvent(app.currentAttempt .getAppAttemptId(), disableTransferState)); {noformat} ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007358#comment-14007358 ] Tsuyoshi OZAWA commented on YARN-1365: -- Hi [~rohithsharma], can you clarify us the case InvalidStateTranstion exception is caused? IIUC, the recovery path is as follows: 1. RMAppManager#recoverApplication() is invoked. 2. Handling RMAppEvent(appId, RMAppEventType.RECOVER) and RMAppRecoveredTransition() is invoked. 3. Handling AppAttemptAddedSchedulerEvent() and APP_ATTEMPT_ADDED is handled. I thought this path works well and the test case included in a patch covers it. Please correct me if I'm wrong. Thanks. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007768#comment-14007768 ] Anubhav Dhoot commented on YARN-1365: - The error is RMAppRecoveredTransition leaves it in LAUNCHED and then scheduler executes ATTEMPT_ADDED. I see Jian fixed it in a certain way in YARN-1368. But that only addresses it if its in LAUNCHED. If the state reaches RUNNING before that we still get the error. The option is see is we pass in a flag to AppAttemptAddedSchedulerEvent that tells scheduler not to issue ATTEMPT_ADDED. This will be set in RMAppRecoveredTransition. Lemme know what you think ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007889#comment-14007889 ] Vinod Kumar Vavilapalli commented on YARN-1365: --- Can we please consolidate YARN-1366 and YARN-1365 into one JIRA? ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.003.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004900#comment-14004900 ] Anubhav Dhoot commented on YARN-1365: - The failed test has race conditions that i am fixing. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14003782#comment-14003782 ] Hadoop QA commented on YARN-1365: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645826/YARN-1365.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3775//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3775//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.002.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999139#comment-13999139 ] Hadoop QA commented on YARN-1365: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12645051/YARN-1365.001.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3746//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3746//console This message is automatically generated. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998625#comment-13998625 ] Tsuyoshi OZAWA commented on YARN-1365: -- Oops, this comment is for YARN-1367. I'll comment it on YARN-1367. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999719#comment-13999719 ] Tsuyoshi OZAWA commented on YARN-1365: -- Sure! I'll check it. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998592#comment-13998592 ] Tsuyoshi OZAWA commented on YARN-1365: -- I've read your code. The prototype is including following changes: 1. Changed NodeManager's RegisterNodeManagerRequest to send ContainerReport. 2. Added Configuration about RM_WORK_PRESERVING_RECOVERY_ENABLED. 3. Added cluster timestamp to Container Id. I think we should focus on NM should resync with the RM when the RM_WORK_PRESERVING_RECOVERY_ENABLED is set to true. Can you add resync code(ResourceManager's side code) into the patch? Also, in regard to ContainerId format, let's discuss on YARN-2052. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998979#comment-13998979 ] Anubhav Dhoot commented on YARN-1365: - Hi [~ozawa] just saw your comment after i had it ready. Can you please help review the tests i added. Thanks. ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.001.patch, YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart
[ https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998633#comment-13998633 ] Tsuyoshi OZAWA commented on YARN-1365: -- [~adhoot], It looks good to me. Do you mind if I work together by writing test code for this? ApplicationMasterService to allow Register and Unregister of an app that was running before restart --- Key: YARN-1365 URL: https://issues.apache.org/jira/browse/YARN-1365 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1365.initial.patch For an application that was running before restart, the ApplicationMasterService currently throws an exception when the app tries to make the initial register or final unregister call. These should succeed and the RMApp state machine should transition to completed like normal. Unregistration should succeed for an app that the RM considers complete since the RM may have died after saving completion in the store but before notifying the AM that the AM is free to exit. -- This message was sent by Atlassian JIRA (v6.2#6252)