[jira] [Commented] (YARN-1370) Fair scheduler to re-populate container allocation state
[ https://issues.apache.org/jira/browse/YARN-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092029#comment-14092029 ] Hadoop QA commented on YARN-1370: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660838/YARN-1370.001.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4578//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4578//console This message is automatically generated. Fair scheduler to re-populate container allocation state Key: YARN-1370 URL: https://issues.apache.org/jira/browse/YARN-1370 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1370.001.patch YARN-1367 and YARN-1368 enable the NM to tell the RM about currently running containers and the RM will pass this information to the schedulers along with the node information. The schedulers are currently already informed about previously running apps when the app data is recovered from the store. The scheduler is expected to be able to repopulate its allocation state from the above 2 sources of information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092047#comment-14092047 ] Hudson commented on YARN-2302: -- FAILURE: Integrated in Hadoop-trunk-Commit #6044 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6044/]) YARN-2302. Refactor TimelineWebServices. (Contributed by Zhijie Shen) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617055) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2302.1.patch, YARN-2302.2.patch, YARN-2302.3.patch, YARN-2302.4.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092068#comment-14092068 ] Hudson commented on YARN-2302: -- FAILURE: Integrated in Hadoop-Yarn-trunk #640 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/640/]) YARN-2302. Refactor TimelineWebServices. (Contributed by Zhijie Shen) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617055) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2302.1.patch, YARN-2302.2.patch, YARN-2302.3.patch, YARN-2302.4.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2400) TestAMRestart fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092067#comment-14092067 ] Hudson commented on YARN-2400: -- FAILURE: Integrated in Hadoop-Yarn-trunk #640 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/640/]) YARN-2400. Fixed TestAMRestart fails intermittently. Contributed by Jian He: (xgong: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java TestAMRestart fails intermittently -- Key: YARN-2400 URL: https://issues.apache.org/jira/browse/YARN-2400 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2240.2.patch, YARN-2400.1.patch java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:579) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:586) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:389) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092069#comment-14092069 ] Hudson commented on YARN-1954: -- FAILURE: Integrated in Hadoop-Yarn-trunk #640 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/640/]) YARN-1954. Added waitFor to AMRMClient(Async). Contributed by Tsuyoshi Ozawa. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617002) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/async/impl/TestAMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch, YARN-1954.8.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2400) TestAMRestart fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092088#comment-14092088 ] Hudson commented on YARN-2400: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1833 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1833/]) YARN-2400. Fixed TestAMRestart fails intermittently. Contributed by Jian He: (xgong: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java TestAMRestart fails intermittently -- Key: YARN-2400 URL: https://issues.apache.org/jira/browse/YARN-2400 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2240.2.patch, YARN-2400.1.patch java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:579) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:586) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:389) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092090#comment-14092090 ] Hudson commented on YARN-1954: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1833 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1833/]) YARN-1954. Added waitFor to AMRMClient(Async). Contributed by Tsuyoshi Ozawa. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617002) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/async/impl/TestAMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch, YARN-1954.8.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092089#comment-14092089 ] Hudson commented on YARN-2302: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1833 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1833/]) YARN-2302. Refactor TimelineWebServices. (Contributed by Zhijie Shen) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617055) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2302.1.patch, YARN-2302.2.patch, YARN-2302.3.patch, YARN-2302.4.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2302) Refactor TimelineWebServices
[ https://issues.apache.org/jira/browse/YARN-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092098#comment-14092098 ] Hudson commented on YARN-2302: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1859 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1859/]) YARN-2302. Refactor TimelineWebServices. (Contributed by Zhijie Shen) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617055) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/TimelineWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java Refactor TimelineWebServices Key: YARN-2302 URL: https://issues.apache.org/jira/browse/YARN-2302 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2302.1.patch, YARN-2302.2.patch, YARN-2302.3.patch, YARN-2302.4.patch Now TimelineWebServices contains non-trivial logic to process the HTTP requests, manipulate the data, check the access, and interact with the timeline store. I propose the move the data-oriented logic to a middle layer (so called TimelineDataManager), and TimelineWebServices only processes the requests, and call TimelineDataManager to complete the remaining tasks. By doing this, we make the generic history module reuse TimelineDataManager internally (YARN-2033), invoking the putting/getting methods directly. Otherwise, we have to send the HTTP requests to TimelineWebServices to query the generic history data, which is not an efficient way. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2400) TestAMRestart fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092097#comment-14092097 ] Hudson commented on YARN-2400: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1859 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1859/]) YARN-2400. Fixed TestAMRestart fails intermittently. Contributed by Jian He: (xgong: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617028) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java TestAMRestart fails intermittently -- Key: YARN-2400 URL: https://issues.apache.org/jira/browse/YARN-2400 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2240.2.patch, YARN-2400.1.patch java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:417) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:579) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:586) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:389) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092099#comment-14092099 ] Hudson commented on YARN-1954: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1859 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1859/]) YARN-1954. Added waitFor to AMRMClient(Async). Contributed by Tsuyoshi Ozawa. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617002) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/async/AMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/async/impl/TestAMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch, YARN-1954.8.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time
[ https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1915: - Attachment: YARN-1915.patch We're starting to see this as well in our rollout of 2.x. Attaching a patch that works around the issue by having the AM secret manager wait around for a bit before trying to validate a token if the master key isn't set yet. Another approach we could try is to have the RM not advertise to clients where the AM is (i.e.: hide the host, port, and tracking URL) until the RM has seen at least one heartbeat after the AM registered. The approach in this patch was easy to implement and probably just as effective in practice. ClientToAMTokenMasterKey should be provided to AM at launch time Key: YARN-1915 URL: https://issues.apache.org/jira/browse/YARN-1915 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Hitesh Shah Priority: Critical Attachments: YARN-1915.patch Currently, the AM receives the key as part of registration. This introduces a race where a client can connect to the AM when the AM has not received the key. Current Flow: 1) AM needs to start the client listening service in order to get host:port and send it to the RM as part of registration 2) RM gets the port info in register() and transitions the app to RUNNING. Responds back with client secret to AM. 3) User asks RM for client token. Gets it and pings the AM. AM hasn't received client secret from RM and so RPC itself rejects the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2402) NM restart: Container recovery for Windows
Jason Lowe created YARN-2402: Summary: NM restart: Container recovery for Windows Key: YARN-2402 URL: https://issues.apache.org/jira/browse/YARN-2402 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe We should add container recovery for NM restart on Windows. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2402) NM restart: Container recovery for Windows
[ https://issues.apache.org/jira/browse/YARN-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092185#comment-14092185 ] Jason Lowe commented on YARN-2402: -- See YARN-1337 for the changes needed to the container executors to handle this on UNIX/Linux. NM restart: Container recovery for Windows -- Key: YARN-2402 URL: https://issues.apache.org/jira/browse/YARN-2402 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe We should add container recovery for NM restart on Windows. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1915) ClientToAMTokenMasterKey should be provided to AM at launch time
[ https://issues.apache.org/jira/browse/YARN-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092196#comment-14092196 ] Hadoop QA commented on YARN-1915: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660870/YARN-1915.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4580//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4580//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4580//console This message is automatically generated. ClientToAMTokenMasterKey should be provided to AM at launch time Key: YARN-1915 URL: https://issues.apache.org/jira/browse/YARN-1915 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0 Reporter: Hitesh Shah Assignee: Jason Lowe Priority: Critical Attachments: YARN-1915.patch Currently, the AM receives the key as part of registration. This introduces a race where a client can connect to the AM when the AM has not received the key. Current Flow: 1) AM needs to start the client listening service in order to get host:port and send it to the RM as part of registration 2) RM gets the port info in register() and transitions the app to RUNNING. Responds back with client secret to AM. 3) User asks RM for client token. Gets it and pings the AM. AM hasn't received client secret from RM and so RPC itself rejects the request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chang li updated YARN-2308: --- Attachment: jira2308.patch NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Attachments: jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092245#comment-14092245 ] Hadoop QA commented on YARN-2308: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660878/jira2308.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4581//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4581//console This message is automatically generated. NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Attachments: jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092246#comment-14092246 ] Karthik Kambatla commented on YARN-2315: Thanks for catching this, [~zxu]. Mind adding a test case or augmenting existing tests to demonstrate the problem? Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. --- Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2337) ResourceManager sets ClientRMService in RMContext multiple times
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2337: --- Priority: Trivial (was: Minor) Summary: ResourceManager sets ClientRMService in RMContext multiple times (was: remove duplication function call (setClientRMService) in resource manage class) ResourceManager sets ClientRMService in RMContext multiple times Key: YARN-2337 URL: https://issues.apache.org/jira/browse/YARN-2337 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: newbie Attachments: YARN-2337.000.patch remove duplication function call (setClientRMService) in resource manage class. rmContext.setClientRMService(clientRM); is duplicate in serviceInit of ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2337) ResourceManager sets ClientRMService in RMContext multiple times
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2337: --- Target Version/s: 2.6.0 Affects Version/s: 2.5.0 Labels: newbie (was: ) ResourceManager sets ClientRMService in RMContext multiple times Key: YARN-2337 URL: https://issues.apache.org/jira/browse/YARN-2337 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: newbie Attachments: YARN-2337.000.patch remove duplication function call (setClientRMService) in resource manage class. rmContext.setClientRMService(clientRM); is duplicate in serviceInit of ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2337) ResourceManager sets ClientRMService in RMContext multiple times
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092249#comment-14092249 ] Karthik Kambatla commented on YARN-2337: +1. Committing this. ResourceManager sets ClientRMService in RMContext multiple times Key: YARN-2337 URL: https://issues.apache.org/jira/browse/YARN-2337 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: newbie Attachments: YARN-2337.000.patch remove duplication function call (setClientRMService) in resource manage class. rmContext.setClientRMService(clientRM); is duplicate in serviceInit of ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1337) Recover containers upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1337: - Attachment: YARN-1337-v2.patch Thanks for the comments, Junping! bq. May be LOG.warn is a better option here? Changed to a warning. bq. What about msecLeft = 0? the logic get quit from while loop but not throw exception, better to be msecLeft = 0. Good catch! I changed it to msecLeft = 0. bq. We should open a JIRA for this? Filed YARN-2402 to track adding container recovery support for Windows. bq. Again, what would happen if container get removed failed (and other actions, i.e. store, etc.)? If storeContainer fails then the corresponding container start request will also fail. If storeContainerLaunched fails then the container launch process will fail and the container will be marked as failed. If storeContainerKilled fails then the corresponding container kill request will also fail. If storeContainerDiagnostics fails then we can lose prior diagnostic strings for a container upon restart but the container will continue. This seems like a reasonable tradeoff, but we could change it to cause the store failure to also kill the container if deemed more desirable. If removeContainer fails then the container will remain in the state store but be removed from the internal state. That means we'll reload the completed container state upon restart, but this should be safe because we'll only track it as a completed container that will eventually be removed from memory by the NodeStatusUpdaterImpl the next time it scans for old containers. bq. We mark NM port to be 0 for identifying if delayedRpcServerStart. Does this sound a little tricky? May be replace it with a new configuration? No new config necessary. Essentially the issue is that we need to delay starting the RPC server if we're recovering containers because client requests for containers being recovered can disrupt the recovery process. I updated the code to try to make this more clear. bq. Unnecessary change? Removed. bq. It could cause trouble here if we allow NM’s resource get changed (when YARN-291 get done) during NM restart. We may just remove the killing container code rather than move it to else where? Good point. I removed the container killing code and had the node update itself with any new resource total and http address in case those were updated as part of the NM restart. I also had to fix a bug where the CapacityScheduler wasn't updating queue metrics when a node's resources changed during a status update. Recover containers upon nodemanager restart --- Key: YARN-1337 URL: https://issues.apache.org/jira/browse/YARN-1337 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1337-v1.patch, YARN-1337-v2.patch To support work-preserving NM restart we need to recover the state of the containers when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. The state of finished containers also needs to be recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2403) TestNodeManagerResync fails occasionally in trunk
Ted Yu created YARN-2403: Summary: TestNodeManagerResync fails occasionally in trunk Key: YARN-2403 URL: https://issues.apache.org/jira/browse/YARN-2403 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk/640/ : {code} TestNodeManagerResync.testKillContainersOnResync:112-testContainerPreservationOnResyncImpl:146 expected:2 but was:1 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2337) ResourceManager sets ClientRMService in RMContext multiple times
[ https://issues.apache.org/jira/browse/YARN-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092295#comment-14092295 ] Hudson commented on YARN-2337: -- FAILURE: Integrated in Hadoop-trunk-Commit #6046 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6046/]) YARN-2337. ResourceManager sets ClientRMService in RMContext multiple times. (Zhihai Xu via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617183) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java ResourceManager sets ClientRMService in RMContext multiple times Key: YARN-2337 URL: https://issues.apache.org/jira/browse/YARN-2337 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: newbie Fix For: 2.6.0 Attachments: YARN-2337.000.patch remove duplication function call (setClientRMService) in resource manage class. rmContext.setClientRMService(clientRM); is duplicate in serviceInit of ResourceManager. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092301#comment-14092301 ] Wangda Tan commented on YARN-2308: -- [~lichangleo], Thanks for working on this, I took a quick scan at your patch, I think the general approach should be fine. Some minor suggestions: 1) {code} +if (application==null) { + LOG.info(can't retireve application attempt); + return; +} {code} Please leave a space before and after ==, Use LOG.error instead of info 2) Test code 2.1 bq. +System.out.println(testing queue change!!!); Remove this plz, 2.2 {code} +conf.setBoolean(CapacitySchedulerConfiguration.ENABLE_USER_METRICS, true); +conf.set(CapacitySchedulerConfiguration.RESOURCE_CALCULATOR_CLASS, {code} We may not need this too 2.3 {code} +// clear queue metrics +rm1.clearQueueMetrics(app1); {code} Also this 2.4 It's better to wait and check for app state transition to Failed after it rejected 2.5 I think this test isn't work-preserving restart specific problem, it's better to place the test in TestRMRestart Please let me know if you have any comment on them. Thanks, Wangda NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Attachments: jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092303#comment-14092303 ] Wangda Tan commented on YARN-415: - [~eepayne], bq. I created a common method that both of these call. Thanks! bq. I also noticed that testUsageWithMultipleContainers was doing similar things to testUsageAfterRMRestart, so I combined them both into testUsageWithMultipleContainersAndRMRestart. Good catch, I don't have further comments, but would you please check test failure above? Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine
[ https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092321#comment-14092321 ] Karthik Kambatla commented on YARN-2361: +1. Checking this in. remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine -- Key: YARN-2361 URL: https://issues.apache.org/jira/browse/YARN-2361 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Attachments: YARN-2361.000.patch remove duplicate entries in the EnumSet of event type in RMAppAttempt state machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the following code. {code} EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.LAUNCHED, RMAppAttemptEventType.LAUNCH_FAILED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.REGISTERED, RMAppAttemptEventType.CONTAINER_ALLOCATED, RMAppAttemptEventType.UNREGISTERED, RMAppAttemptEventType.KILL, RMAppAttemptEventType.STATUS_UPDATE)) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine
[ https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2361: --- Priority: Trivial (was: Minor) Target Version/s: 2.6.0 Affects Version/s: 2.5.0 Assignee: zhihai xu remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine -- Key: YARN-2361 URL: https://issues.apache.org/jira/browse/YARN-2361 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Attachments: YARN-2361.000.patch remove duplicate entries in the EnumSet of event type in RMAppAttempt state machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the following code. {code} EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.LAUNCHED, RMAppAttemptEventType.LAUNCH_FAILED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.REGISTERED, RMAppAttemptEventType.CONTAINER_ALLOCATED, RMAppAttemptEventType.UNREGISTERED, RMAppAttemptEventType.KILL, RMAppAttemptEventType.STATUS_UPDATE)) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2361) RMAppAttempt state machine entries for KILLED state has duplicate event entries
[ https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2361: --- Summary: RMAppAttempt state machine entries for KILLED state has duplicate event entries (was: remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine) RMAppAttempt state machine entries for KILLED state has duplicate event entries --- Key: YARN-2361 URL: https://issues.apache.org/jira/browse/YARN-2361 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Attachments: YARN-2361.000.patch remove duplicate entries in the EnumSet of event type in RMAppAttempt state machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the following code. {code} EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.LAUNCHED, RMAppAttemptEventType.LAUNCH_FAILED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.REGISTERED, RMAppAttemptEventType.CONTAINER_ALLOCATED, RMAppAttemptEventType.UNREGISTERED, RMAppAttemptEventType.KILL, RMAppAttemptEventType.STATUS_UPDATE)) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2361) RMAppAttempt state machine entries for KILLED state has duplicate event entries
[ https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092332#comment-14092332 ] Hudson commented on YARN-2361: -- FAILURE: Integrated in Hadoop-trunk-Commit #6047 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6047/]) YARN-2361. RMAppAttempt state machine entries for KILLED state has duplicate event entries. (Zhihai Xu via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1617190) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java RMAppAttempt state machine entries for KILLED state has duplicate event entries --- Key: YARN-2361 URL: https://issues.apache.org/jira/browse/YARN-2361 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Fix For: 2.6.0 Attachments: YARN-2361.000.patch remove duplicate entries in the EnumSet of event type in RMAppAttempt state machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the following code. {code} EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.LAUNCHED, RMAppAttemptEventType.LAUNCH_FAILED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.REGISTERED, RMAppAttemptEventType.CONTAINER_ALLOCATED, RMAppAttemptEventType.UNREGISTERED, RMAppAttemptEventType.KILL, RMAppAttemptEventType.STATUS_UPDATE)) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2277) Add Cross-Origin support to the ATS REST API
[ https://issues.apache.org/jira/browse/YARN-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-2277: -- Attachment: YARN-2277-v5.patch Addressing findbugs warning Add Cross-Origin support to the ATS REST API Key: YARN-2277 URL: https://issues.apache.org/jira/browse/YARN-2277 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2277-CORS.patch, YARN-2277-JSONP.patch, YARN-2277-v2.patch, YARN-2277-v3.patch, YARN-2277-v3.patch, YARN-2277-v4.patch, YARN-2277-v5.patch As the Application Timeline Server is not provided with built-in UI, it may make sense to enable JSONP or CORS Rest API capabilities to allow for remote UI to access the data directly via javascript without cross side server browser blocks coming into play. Example client may be like http://api.jquery.com/jQuery.getJSON/ This can alleviate the need to create a local proxy cache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1729: -- Assignee: Billie Rinaldi (was: Leitao Guo) TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Fix For: 2.4.0 Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch, YARN-1729.6.patch, YARN-1729.7.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092413#comment-14092413 ] Jian He commented on YARN-2138: --- seems RMAppUpdatedSavedEvent, RMAppNewSavedEvent etc. are empty files , can you remove them ? Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.003.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1954) Add waitFor to AMRMClient(Async)
[ https://issues.apache.org/jira/browse/YARN-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092425#comment-14092425 ] Tsuyoshi OZAWA commented on YARN-1954: -- Thank you for your review and comments, Zhijie! Add waitFor to AMRMClient(Async) Key: YARN-1954 URL: https://issues.apache.org/jira/browse/YARN-1954 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 3.0.0, 2.4.0 Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1954.1.patch, YARN-1954.2.patch, YARN-1954.3.patch, YARN-1954.4.patch, YARN-1954.4.patch, YARN-1954.5.patch, YARN-1954.6.patch, YARN-1954.7.patch, YARN-1954.8.patch Recently, I saw some use cases of AMRMClient(Async). The painful thing is that the main non-daemon thread has to sit in a dummy loop to prevent AM process exiting before all the tasks are done, while unregistration is triggered on a separate another daemon thread by callback methods (in particular when using AMRMClientAsync). IMHO, it should be beneficial to add a waitFor method to AMRMClient(Async) to block the AM until unregistration or user supplied check point, such that users don't need to write the loop themselves. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2404) Cleanup ApplicationAttemptState and ApplicationState in RMStateStore
Jian He created YARN-2404: - Summary: Cleanup ApplicationAttemptState and ApplicationState in RMStateStore Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we can just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2404: -- Description: We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. (was: We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we can just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState.) Summary: Remove ApplicationAttemptState and ApplicationState class in RMStateStore class (was: Cleanup ApplicationAttemptState and ApplicationState in RMStateStore ) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092443#comment-14092443 ] Varun Saxena commented on YARN-2138: [~jianhe], I have removed these files in the patch. To verify, I applied the patch(YARN-2138.003.patch) to code downloaded from trunk and find the above mentioned files getting deleted. So, this patch should work. I used SVN delete to delete the files. Let me know if something else needs to be done. Can you verify the patch once more ? If you are still facing issues, I will generate a new patch. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.003.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2138) Cleanup notifyDone* methods in RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092458#comment-14092458 ] Jian He commented on YARN-2138: --- Varun, I tried to apply the patch in both git and svn repository with patch -p0, the files still remain but just that they are empty. Do you mind creating a new patch? the patch seems conflicting with trunk again. Cleanup notifyDone* methods in RMStateStore --- Key: YARN-2138 URL: https://issues.apache.org/jira/browse/YARN-2138 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Varun Saxena Attachments: YARN-2138.002.patch, YARN-2138.003.patch, YARN-2138.patch The storedException passed into notifyDoneStoringApplication is always null. Similarly for other notifyDone* methods. We can clean up these methods as this control flow path is not used anymore. -- This message was sent by Atlassian JIRA (v6.2#6252)