[jira] [Commented] (YARN-2075) TestRMAdminCLI consistently fail on trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015227#comment-14015227 ] Tsuyoshi OZAWA commented on YARN-2075: -- I could reproduce this problem on both trunk and branch-2 and the patch works well both of them on my local. [~mitdesai], can you tell us what command did you run? I ran {{mvn clean test -Dtest=TestRMAdminCLI}} with the patch and it works well. TestRMAdminCLI consistently fail on trunk and branch-2 -- Key: YARN-2075 URL: https://issues.apache.org/jira/browse/YARN-2075 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen Assignee: Kenji Kikushima Attachments: YARN-2075.patch {code} Running org.apache.hadoop.yarn.client.TestRMAdminCLI Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.082 sec ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.088 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1874) Cleanup: Move RMActiveServices out of ResourceManager into its own file
[ https://issues.apache.org/jira/browse/YARN-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015228#comment-14015228 ] Tsuyoshi OZAWA commented on YARN-1874: -- I found that the test failure is not related to a patch - it's filed on YARN-2075. Resubmitted a patch without updating. Cleanup: Move RMActiveServices out of ResourceManager into its own file --- Key: YARN-1874 URL: https://issues.apache.org/jira/browse/YARN-1874 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1874.1.patch, YARN-1874.2.patch, YARN-1874.3.patch, YARN-1874.4.patch As [~vinodkv] noticed on YARN-1867, ResourceManager is hard to maintain. We should move RMActiveServices out to make it more manageable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-804) mark AbstractService init/start/stop methods as final
[ https://issues.apache.org/jira/browse/YARN-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved YARN-804. - Resolution: Won't Fix I don't think we can fix this while mocking is used to test some aspects of the implementation classes...WONTFIX unless there's a workaround mark AbstractService init/start/stop methods as final - Key: YARN-804 URL: https://issues.apache.org/jira/browse/YARN-804 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.1.0-beta Reporter: Steve Loughran Assignee: Vinod Kumar Vavilapalli Attachments: YARN-804-001.patch Now that YARN-117 and MAPREDUCE-5298 are checked in, we can mark the public AbstractService init/start/stop methods as final. Why? It puts the lifecycle check and error handling around the subclass code, ensuring no lifecycle method gets called in the wrong state or gets called more than once.When a {{serviceInit(), serviceStart() serviceStop()}} method throws an exception, it's caught and auto-triggers stop. Marking the methods as final forces service implementations to move to the stricter lifecycle. It has one side effect: some of the mocking tests play up -I'll need some assistance here -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch Thanks, Sandy. Upload a new patch, move the AM resource usage check to AppSchedulable. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2103) Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder
[ https://issues.apache.org/jira/browse/YARN-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015275#comment-14015275 ] Hudson commented on YARN-2103: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5642 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5642/]) YARN-2103. Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder (Contributed by Binglin Chang) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599115) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder - Key: YARN-2103 URL: https://issues.apache.org/jira/browse/YARN-2103 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.5.0 Attachments: YARN-2103.v1.patch, YARN-2103.v2.patch, YARN-2103.v3.patch Bug 1: {code} SerializedExceptionProto proto = SerializedExceptionProto .getDefaultInstance(); SerializedExceptionProto.Builder builder = null; boolean viaProto = false; {code} Since viaProto is false, we should initiate build rather than proto Bug 2: the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-741) Mark yarn.service package as public unstable
[ https://issues.apache.org/jira/browse/YARN-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015298#comment-14015298 ] Steve Loughran commented on YARN-741: - fixed in YARN-825 Mark yarn.service package as public unstable Key: YARN-741 URL: https://issues.apache.org/jira/browse/YARN-741 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.0.4-alpha Reporter: Steve Loughran The package info file {{/org/apache/hadoop/yarn/service/package-info.java}} marks the package as private -yet its something all YARN apps need to use (by way of {{YarnClientImpl}}, and it's something all YARN AMs and containers should be building from. Once we are happy with the API and the documentation, mark it as public, leaving it unstable until we have been using it enough to be confident that it is -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-741) Mark yarn.service package as public unstable
[ https://issues.apache.org/jira/browse/YARN-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved YARN-741. - Resolution: Duplicate Mark yarn.service package as public unstable Key: YARN-741 URL: https://issues.apache.org/jira/browse/YARN-741 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.0.4-alpha Reporter: Steve Loughran The package info file {{/org/apache/hadoop/yarn/service/package-info.java}} marks the package as private -yet its something all YARN apps need to use (by way of {{YarnClientImpl}}, and it's something all YARN AMs and containers should be building from. Once we are happy with the API and the documentation, mark it as public, leaving it unstable until we have been using it enough to be confident that it is -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015300#comment-14015300 ] Hadoop QA commented on YARN-1913: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647873/YARN-1913.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3884//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3884//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3884//console This message is automatically generated. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2103) Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder
[ https://issues.apache.org/jira/browse/YARN-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015334#comment-14015334 ] Hudson commented on YARN-2103: -- FAILURE: Integrated in Hadoop-Yarn-trunk #571 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/571/]) YARN-2103. Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder (Contributed by Binglin Chang) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599115) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder - Key: YARN-2103 URL: https://issues.apache.org/jira/browse/YARN-2103 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.5.0 Attachments: YARN-2103.v1.patch, YARN-2103.v2.patch, YARN-2103.v3.patch Bug 1: {code} SerializedExceptionProto proto = SerializedExceptionProto .getDefaultInstance(); SerializedExceptionProto.Builder builder = null; boolean viaProto = false; {code} Since viaProto is false, we should initiate build rather than proto Bug 2: the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2117) Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block
[ https://issues.apache.org/jira/browse/YARN-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2117: -- Attachment: YARN-2117.patch Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block --- Key: YARN-2117 URL: https://issues.apache.org/jira/browse/YARN-2117 Project: Hadoop YARN Issue Type: Sub-task Reporter: Ted Yu Assignee: Chen He Priority: Minor Labels: newbie Attachments: YARN-2117.patch Here is related code: {code} Reader reader = new FileReader(signatureSecretFile); int c = reader.read(); while (c -1) { secret.append((char) c); c = reader.read(); } reader.close(); {code} If IOException is thrown out of reader.read(), reader would be left unclosed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2117) Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block
[ https://issues.apache.org/jira/browse/YARN-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2117: -- Attachment: YARN-2117.patch Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block --- Key: YARN-2117 URL: https://issues.apache.org/jira/browse/YARN-2117 Project: Hadoop YARN Issue Type: Sub-task Reporter: Ted Yu Assignee: Chen He Priority: Minor Labels: newbie Attachments: YARN-2117.patch Here is related code: {code} Reader reader = new FileReader(signatureSecretFile); int c = reader.read(); while (c -1) { secret.append((char) c); c = reader.read(); } reader.close(); {code} If IOException is thrown out of reader.read(), reader would be left unclosed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2117) Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block
[ https://issues.apache.org/jira/browse/YARN-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2117: -- Attachment: (was: YARN-2117.patch) Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block --- Key: YARN-2117 URL: https://issues.apache.org/jira/browse/YARN-2117 Project: Hadoop YARN Issue Type: Sub-task Reporter: Ted Yu Assignee: Chen He Priority: Minor Labels: newbie Attachments: YARN-2117.patch Here is related code: {code} Reader reader = new FileReader(signatureSecretFile); int c = reader.read(); while (c -1) { secret.append((char) c); c = reader.read(); } reader.close(); {code} If IOException is thrown out of reader.read(), reader would be left unclosed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2075) TestRMAdminCLI consistently fail on trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015401#comment-14015401 ] Mit Desai commented on YARN-2075: - [~ozawa] and [~kj-ki] that was my bad. My local repo might not have been updated when I tested. I tested the patch and it work fine for me too. Patch looks good to me. +1 (non binding) TestRMAdminCLI consistently fail on trunk and branch-2 -- Key: YARN-2075 URL: https://issues.apache.org/jira/browse/YARN-2075 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen Assignee: Kenji Kikushima Attachments: YARN-2075.patch {code} Running org.apache.hadoop.yarn.client.TestRMAdminCLI Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.082 sec ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.088 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1874) Cleanup: Move RMActiveServices out of ResourceManager into its own file
[ https://issues.apache.org/jira/browse/YARN-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1874: - Attachment: YARN-1874.4.patch Cleanup: Move RMActiveServices out of ResourceManager into its own file --- Key: YARN-1874 URL: https://issues.apache.org/jira/browse/YARN-1874 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1874.1.patch, YARN-1874.2.patch, YARN-1874.3.patch, YARN-1874.4.patch As [~vinodkv] noticed on YARN-1867, ResourceManager is hard to maintain. We should move RMActiveServices out to make it more manageable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1874) Cleanup: Move RMActiveServices out of ResourceManager into its own file
[ https://issues.apache.org/jira/browse/YARN-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1874: - Attachment: (was: YARN-1874.4.patch) Cleanup: Move RMActiveServices out of ResourceManager into its own file --- Key: YARN-1874 URL: https://issues.apache.org/jira/browse/YARN-1874 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1874.1.patch, YARN-1874.2.patch, YARN-1874.3.patch, YARN-1874.4.patch As [~vinodkv] noticed on YARN-1867, ResourceManager is hard to maintain. We should move RMActiveServices out to make it more manageable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2103) Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder
[ https://issues.apache.org/jira/browse/YARN-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015407#comment-14015407 ] Hudson commented on YARN-2103: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1762 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1762/]) YARN-2103. Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder (Contributed by Binglin Chang) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599115) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder - Key: YARN-2103 URL: https://issues.apache.org/jira/browse/YARN-2103 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.5.0 Attachments: YARN-2103.v1.patch, YARN-2103.v2.patch, YARN-2103.v3.patch Bug 1: {code} SerializedExceptionProto proto = SerializedExceptionProto .getDefaultInstance(); SerializedExceptionProto.Builder builder = null; boolean viaProto = false; {code} Since viaProto is false, we should initiate build rather than proto Bug 2: the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2103) Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder
[ https://issues.apache.org/jira/browse/YARN-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015414#comment-14015414 ] Tsuyoshi OZAWA commented on YARN-2103: -- Thanks and good job, [~decster], and thanks you for committing and review, [~djp]! Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder - Key: YARN-2103 URL: https://issues.apache.org/jira/browse/YARN-2103 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.5.0 Attachments: YARN-2103.v1.patch, YARN-2103.v2.patch, YARN-2103.v3.patch Bug 1: {code} SerializedExceptionProto proto = SerializedExceptionProto .getDefaultInstance(); SerializedExceptionProto.Builder builder = null; boolean viaProto = false; {code} Since viaProto is false, we should initiate build rather than proto Bug 2: the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2075) TestRMAdminCLI consistently fail on trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015421#comment-14015421 ] Tsuyoshi OZAWA commented on YARN-2075: -- [~mitdesai], Thanks for reporitng. +1 (non-binding), too. [~zjshen], could you take a look, please? TestRMAdminCLI consistently fail on trunk and branch-2 -- Key: YARN-2075 URL: https://issues.apache.org/jira/browse/YARN-2075 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen Assignee: Kenji Kikushima Attachments: YARN-2075.patch {code} Running org.apache.hadoop.yarn.client.TestRMAdminCLI Tests run: 13, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 1.191 sec FAILURE! - in org.apache.hadoop.yarn.client.TestRMAdminCLI testTransitionToActive(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.082 sec ERROR! java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) testHelp(org.apache.hadoop.yarn.client.TestRMAdminCLI) Time elapsed: 0.088 sec FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2117) Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block
[ https://issues.apache.org/jira/browse/YARN-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015429#comment-14015429 ] Hadoop QA commented on YARN-2117: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647900/YARN-2117.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3885//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3885//console This message is automatically generated. Close of Reader in TimelineAuthenticationFilterInitializer#initFilter() should be enclosed in finally block --- Key: YARN-2117 URL: https://issues.apache.org/jira/browse/YARN-2117 Project: Hadoop YARN Issue Type: Sub-task Reporter: Ted Yu Assignee: Chen He Priority: Minor Labels: newbie Attachments: YARN-2117.patch Here is related code: {code} Reader reader = new FileReader(signatureSecretFile); int c = reader.read(); while (c -1) { secret.append((char) c); c = reader.read(); } reader.close(); {code} If IOException is thrown out of reader.read(), reader would be left unclosed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1550: -- Assignee: Anubhav Dhoot Looks good to me. +1. Committing this shortly. NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Assignee: Anubhav Dhoot Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015439#comment-14015439 ] Karthik Kambatla commented on YARN-1550: Actually, I run into the following NPE when running the new test locally. [~adhoot] - can you please take a look, it might be other changes that went in the interim? {noformat} java.lang.NullPointerException: null at org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.ClusterMetricsInfo.init(ClusterMetricsInfo.java:65) at org.apache.hadoop.yarn.server.resourcemanager.webapp.MetricsOverviewTable.render(MetricsOverviewTable.java:58) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {noformat} NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Assignee: Anubhav Dhoot Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015442#comment-14015442 ] Hadoop QA commented on YARN-1913: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647905/YARN-1913.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3887//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3887//console This message is automatically generated. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1874) Cleanup: Move RMActiveServices out of ResourceManager into its own file
[ https://issues.apache.org/jira/browse/YARN-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015453#comment-14015453 ] Hadoop QA commented on YARN-1874: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647902/YARN-1874.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 20 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestRMAdminCLI {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3886//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3886//console This message is automatically generated. Cleanup: Move RMActiveServices out of ResourceManager into its own file --- Key: YARN-1874 URL: https://issues.apache.org/jira/browse/YARN-1874 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1874.1.patch, YARN-1874.2.patch, YARN-1874.3.patch, YARN-1874.4.patch As [~vinodkv] noticed on YARN-1867, ResourceManager is hard to maintain. We should move RMActiveServices out to make it more manageable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2103) Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder
[ https://issues.apache.org/jira/browse/YARN-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015456#comment-14015456 ] Hudson commented on YARN-2103: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1789 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1789/]) YARN-2103. Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder (Contributed by Binglin Chang) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599115) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java Inconsistency between viaProto flag and initial value of SerializedExceptionProto.Builder - Key: YARN-2103 URL: https://issues.apache.org/jira/browse/YARN-2103 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.5.0 Attachments: YARN-2103.v1.patch, YARN-2103.v2.patch, YARN-2103.v3.patch Bug 1: {code} SerializedExceptionProto proto = SerializedExceptionProto .getDefaultInstance(); SerializedExceptionProto.Builder builder = null; boolean viaProto = false; {code} Since viaProto is false, we should initiate build rather than proto Bug 2: the class does not provide hashcode() and equals() like other PBImpl records, this class is used in other records, it may affect other records' behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015518#comment-14015518 ] Sandy Ryza commented on YARN-1913: -- This is looking good. A small things. AppSchedulingInfo is only used to track pending resources. We should hold amResource in SchedulerApplicationAttempt. {code} + if (! queue.canRunAppAM(app.getAMResource())) { {code} Take out space after exclamation point. {code} @Override + public boolean checkIfAMResourceUsageOverLimit(Resource usage, Resource maxAMResource) { +return Resources.greaterThan(RESOURCE_CALCULATOR, null, usage, maxAMResource); + } {code} Simpler to just use usage.getMemory() maxAMResource.getMemory(). {code} + if (request.getPriority().equals(RMAppAttemptImpl.AM_CONTAINER_PRIORITY)) { {code} I'm a little nervous about using the priority here because apps could unwittingly submit all requests at that priority. Can we use SchedulerApplicationAttempt.getLiveContainers().isEmpty()? With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015536#comment-14015536 ] Wei Yan commented on YARN-1913: --- Thanks, Sandy. One problem may exist if we use SchedulerApplicationAttempt.getLiveContainers().isEmpty(), if the application is unManagedAM, it will not generate an AM resource request. Thus, the first request would be an actual task, not an AM. Correct me if I'm wrong here. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1874) Cleanup: Move RMActiveServices out of ResourceManager into its own file
[ https://issues.apache.org/jira/browse/YARN-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015588#comment-14015588 ] Tsuyoshi OZAWA commented on YARN-1874: -- It's ready for review. This patch includes following changes: 1. Moved RMActiveServices out of ResourceManager into its own file. 2. Added {{getRMAppManager}}, {{getQueueACLsManager}}, {{getApplicationACLsManager}} to RMContext. 3. Changed tests to override {{ResourceManager#createAndInitActiveServices}} method. Cleanup: Move RMActiveServices out of ResourceManager into its own file --- Key: YARN-1874 URL: https://issues.apache.org/jira/browse/YARN-1874 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Karthik Kambatla Assignee: Tsuyoshi OZAWA Attachments: YARN-1874.1.patch, YARN-1874.2.patch, YARN-1874.3.patch, YARN-1874.4.patch As [~vinodkv] noticed on YARN-1867, ResourceManager is hard to maintain. We should move RMActiveServices out to make it more manageable. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015611#comment-14015611 ] Bikas Saha commented on YARN-2091: -- Can this miss a case when the exitCode has not been set (eg when the container crashes on its own)? Should we check if the exitCode has already been set (eg. via a kill event) and if its not set then set it from exitEvent? How can we check if the exitCode has not been set? Maybe have some uninitialized/invalid default value. {code}@@ -829,7 +829,6 @@ public void transition(ContainerImpl container, ContainerEvent event) { @Override public void transition(ContainerImpl container, ContainerEvent event) { ContainerExitEvent exitEvent = (ContainerExitEvent) event; - container.exitCode = exitEvent.getExitCode();{code} The new exit status code need better comments/docs. E.g. what is the difference between to 2 new appmaster related exit status. Is kill_by_resourcemanager a generic value that can be replaced later on by a more specific reason like preempted? Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch, YARN-2091.5.patch, YARN-2091.6.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2119) Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590
Anubhav Dhoot created YARN-2119: --- Summary: Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590 Key: YARN-2119 URL: https://issues.apache.org/jira/browse/YARN-2119 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] introduced an method to get web proxy bind address with the incorrect default port. Because all the users of the method (only 1 user) ignores the port, its not breaking anything yet. Fixing it in case someone else uses this in the future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
Siqi Li created YARN-2120: - Summary: Coloring queues running over minShare on RM Scheduler page Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015653#comment-14015653 ] Karthik Kambatla commented on YARN-2010: Sorry, the commit messages are for the wrong JIRA. Will fix them up. RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Rohith Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2120: -- Description: Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. Coloring queues running over minShare on RM Scheduler page -- Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2120: -- Attachment: AD45B623-9F14-420B-B1FB-1186E2B5EC4A.png Coloring queues running over minShare on RM Scheduler page -- Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li Attachments: AD45B623-9F14-420B-B1FB-1186E2B5EC4A.png Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2119) Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590
[ https://issues.apache.org/jira/browse/YARN-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2119: Attachment: YARN-2119.patch Fix with unit tests. Ran org.apache.hadoop.yarn.server.webproxy.TestWebAppProxyServer tests Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590 - Key: YARN-2119 URL: https://issues.apache.org/jira/browse/YARN-2119 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2119.patch The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] introduced an method to get web proxy bind address with the incorrect default port. Because all the users of the method (only 1 user) ignores the port, its not breaking anything yet. Fixing it in case someone else uses this in the future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2108) Show minShare on RM Fair Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li resolved YARN-2108. --- Resolution: Duplicate Show minShare on RM Fair Scheduler page --- Key: YARN-2108 URL: https://issues.apache.org/jira/browse/YARN-2108 Project: Hadoop YARN Issue Type: Task Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2108.v1.patch, YARN-2108.v2.patch Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. It would be better to show MinShare with possibly different color code, so that we know queue is running more than its min share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (YARN-2108) Show minShare on RM Fair Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li reopened YARN-2108: --- Show minShare on RM Fair Scheduler page --- Key: YARN-2108 URL: https://issues.apache.org/jira/browse/YARN-2108 Project: Hadoop YARN Issue Type: Task Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2108.v1.patch, YARN-2108.v2.patch Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. It would be better to show MinShare with possibly different color code, so that we know queue is running more than its min share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015732#comment-14015732 ] Siqi Li commented on YARN-2120: --- Attached screenshot for proposed coloring scheme Coloring queues running over minShare on RM Scheduler page -- Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li Attachments: AD45B623-9F14-420B-B1FB-1186E2B5EC4A.png Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2108) Show minShare on RM Fair Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li resolved YARN-2108. --- Resolution: Duplicate Show minShare on RM Fair Scheduler page --- Key: YARN-2108 URL: https://issues.apache.org/jira/browse/YARN-2108 Project: Hadoop YARN Issue Type: Task Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2108.v1.patch, YARN-2108.v2.patch Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. It would be better to show MinShare with possibly different color code, so that we know queue is running more than its min share. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1550: Attachment: YARN-1550.003.patch Fixed failures after resolving with some interim changes that were checked in. NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Assignee: Anubhav Dhoot Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.003.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2121) TimelineAuthenticator#hasDelegationToken may throw NPE
Zhijie Shen created YARN-2121: - Summary: TimelineAuthenticator#hasDelegationToken may throw NPE Key: YARN-2121 URL: https://issues.apache.org/jira/browse/YARN-2121 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen {code} private boolean hasDelegationToken(URL url) { return url.getQuery().contains( TimelineAuthenticationConsts.DELEGATION_PARAM + =); } {code} If the given url doesn't have any params at all. It will throw NPE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch Update a patch. Use getLiveContainer().size() and unManagedAM to detect the AM container. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2119) Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590
[ https://issues.apache.org/jira/browse/YARN-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015763#comment-14015763 ] Hadoop QA commented on YARN-2119: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647959/YARN-2119.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3888//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3888//console This message is automatically generated. Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590 - Key: YARN-2119 URL: https://issues.apache.org/jira/browse/YARN-2119 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2119.patch The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] introduced an method to get web proxy bind address with the incorrect default port. Because all the users of the method (only 1 user) ignores the port, its not breaking anything yet. Fixing it in case someone else uses this in the future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015776#comment-14015776 ] Vinod Kumar Vavilapalli commented on YARN-2010: --- bq. It is true that the first time we encountered this was during an upgrade from non-secure to secure cluster. My point is that this is a non-supported use-case. Let's make that explicit by throwing appropriate exception with the right message * (1) bq. However, as I mentioned earlier in the JIRA, it is possible to run into this in other situations. Let's figure out what these situations are and make sure they are handled correctly * (2). Skipping apps in all cases is likely not the right solution. bq. Even in the case of upgrading from non-secure to secure cluster, I totally understand we can't support recovering running/completed applications. However, one shouldn't have to explicitly nuke the ZK store (which by the way is involved due to the ACLs-magic and lacks an rmadmin command) to be able to start the RM. On the other hand, couple with [(1) above, that is exactly what I'd expect. If we skip applications automatically in all cases, that may be a worse thing to happen. - suddenly users will see that they are losing apps for a reason that is not so obvious to them. The risk of crashing the RM is that there is a need manual intervention with a longer downtime. But with (2) above, that risk will be mitigated a lot. Even if we decide to skip them, the outcome is the same - losing the apps.. But it rather be a conscious decision by the admins. Crux of my argument is, let's not do a blanket {code} try { .. } catch (Exception) { continue; } {code} Instead do {code} try { .. } catch (Exception type1) { // handle correctly } catch (Exception type2) { // handle correctly } ... . } catch (Exception catchAll) { // Decide to skip the app or crash the RM. } {code} RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Rohith Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at
[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015798#comment-14015798 ] Hadoop QA commented on YARN-1550: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647960/YARN-1550.003.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3889//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3889//console This message is automatically generated. NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Assignee: Anubhav Dhoot Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.003.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch Thanks, Sandy. Fixed that problem. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015823#comment-14015823 ] Hadoop QA commented on YARN-1913: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647968/YARN-1913.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestMaxRunningAppsEnforcer org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerApplicationAttempt org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSSchedulerApp {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3890//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3890//console This message is automatically generated. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1913: -- Attachment: YARN-1913.patch New patch to fix the test errors. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015844#comment-14015844 ] Karthik Kambatla commented on YARN-1550: Thanks Anubhav. +1. Committing this shortly. NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Assignee: Anubhav Dhoot Priority: Critical Fix For: 2.2.1 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.003.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1590) _HOST doesn't expand properly for RM, NM, ProxyServer and JHS
[ https://issues.apache.org/jira/browse/YARN-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015862#comment-14015862 ] Karthik Kambatla commented on YARN-1590: Just ran into this. If I am not mistaken, in the following snippet, we intended to use DEFAULT_PROXY_PORT instead of DEFAULT_RM_PORT. Correct? {code} PROXY_PREFIX + address; + public static final int DEFAULT_PROXY_PORT = 9099; + public static final String DEFAULT_PROXY_ADDRESS = +0.0.0.0: + DEFAULT_RM_PORT; {code} YARN-2119 has been filed to fix this. _HOST doesn't expand properly for RM, NM, ProxyServer and JHS - Key: YARN-1590 URL: https://issues.apache.org/jira/browse/YARN-1590 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.2.0 Reporter: Mohammad Kamrul Islam Assignee: Mohammad Kamrul Islam Fix For: 2.4.0 Attachments: YARN-1590.1.patch, YARN-1590.2.patch, YARN-1590.3.patch, YARN-1590.4.patch _HOST is not properly substituted when we use VIP address. Currently it always used the host name of the machine and disregard the VIP address. It is true mainly for RM, NM, WebProxy, and JHS rpc service. Looks like it is working fine for webservice authentication. On the other hand, the same thing is working fine for NN and SNN in RPC as well as webservice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2119) Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590
[ https://issues.apache.org/jira/browse/YARN-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015866#comment-14015866 ] Karthik Kambatla commented on YARN-2119: Looks good to me. +1. I ll commit this in a day if no one else has any comments. Fix the DEFAULT_PROXY_ADDRESS used for getBindAddress to fix 1590 - Key: YARN-2119 URL: https://issues.apache.org/jira/browse/YARN-2119 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2119.patch The fix for [YARN-1590|https://issues.apache.org/jira/browse/YARN-1590] introduced an method to get web proxy bind address with the incorrect default port. Because all the users of the method (only 1 user) ignores the port, its not breaking anything yet. Fixing it in case someone else uses this in the future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1877) Document yarn.resourcemanager.zk-auth and its scope
[ https://issues.apache.org/jira/browse/YARN-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015868#comment-14015868 ] Hudson commented on YARN-1877: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5643 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5643/]) YARN-1877. Updated CHANGES.txt to fix the JIRA number. It was previously committed as YARN-2010. (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599348) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Document yarn.resourcemanager.zk-auth and its scope --- Key: YARN-1877 URL: https://issues.apache.org/jira/browse/YARN-1877 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Priority: Critical Fix For: 2.5.0 Attachments: YARN-1877.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1550) NPE in FairSchedulerAppsBlock#render
[ https://issues.apache.org/jira/browse/YARN-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015870#comment-14015870 ] Hudson commented on YARN-1550: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5643 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5643/]) YARN-1550. NPE in FairSchedulerAppsBlock#render. (Anubhav Dhoot via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599345) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/FairSchedulerAppsBlock.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebAppFairScheduler.java NPE in FairSchedulerAppsBlock#render Key: YARN-1550 URL: https://issues.apache.org/jira/browse/YARN-1550 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Reporter: caolong Assignee: Anubhav Dhoot Priority: Critical Fix For: 2.5.0 Attachments: YARN-1550.001.patch, YARN-1550.002.patch, YARN-1550.003.patch, YARN-1550.patch three Steps : 1、debug at RMAppManager#submitApplication after code if (rmContext.getRMApps().putIfAbsent(applicationId, application) != null) { String message = Application with id + applicationId + is already present! Cannot add a duplicate!; LOG.warn(message); throw RPCUtil.getRemoteException(message); } 2、submit one application:hadoop jar ~/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-ydh2.2.0-tests.jar sleep -Dhadoop.job.ugi=test2,#11 -Dmapreduce.job.queuename=p1 -m 1 -mt 1 -r 1 3、go in page :http://ip:50030/cluster/scheduler and find 500 ERROR! the log: {noformat} 2013-12-30 11:51:43,795 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.FairSchedulerAppsBlock.render(FairSchedulerAppsBlock.java:96) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015871#comment-14015871 ] Hudson commented on YARN-2010: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5643 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5643/]) YARN-1877. Updated CHANGES.txt to fix the JIRA number. It was previously committed as YARN-2010. (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599348) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Rohith Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1540) Add an easy way to turn on HA
[ https://issues.apache.org/jira/browse/YARN-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-1540. Resolution: Invalid Target Version/s: (was: ) Other sub-tasks under YARN-149 handle this. Now, it is relatively easy to configure RM HA. This JIRA is Invalid anymore. Please re-open or open another JIRA if you see other possible improvements. Add an easy way to turn on HA - Key: YARN-1540 URL: https://issues.apache.org/jira/browse/YARN-1540 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Users will have to modify the configuration significantly to turn on HA. It would be nice to have a simpler way of doing this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015879#comment-14015879 ] Hadoop QA commented on YARN-1913: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647969/YARN-1913.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestMaxRunningAppsEnforcer org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSSchedulerApp org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerApplicationAttempt {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3891//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3891//console This message is automatically generated. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2122) In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()
Karthik Kambatla created YARN-2122: -- Summary: In AllocationFileLoaderService, the reloadThread should be created in init() and started in start() Key: YARN-2122 URL: https://issues.apache.org/jira/browse/YARN-2122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015896#comment-14015896 ] Hadoop QA commented on YARN-1913: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647976/YARN-1913.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3892//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3892//console This message is automatically generated. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2120: -- Attachment: YARN-2120.v1.patch Coloring queues running over minShare on RM Scheduler page -- Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li Attachments: AD45B623-9F14-420B-B1FB-1186E2B5EC4A.png, YARN-2120.v1.patch Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2121) TimelineAuthenticator#hasDelegationToken may throw NPE
[ https://issues.apache.org/jira/browse/YARN-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2121: -- Attachment: YARN-2121.1.patch Upload a patch to fix the problem, and add the corresponding test cases. TimelineAuthenticator#hasDelegationToken may throw NPE -- Key: YARN-2121 URL: https://issues.apache.org/jira/browse/YARN-2121 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2121.1.patch {code} private boolean hasDelegationToken(URL url) { return url.getQuery().contains( TimelineAuthenticationConsts.DELEGATION_PARAM + =); } {code} If the given url doesn't have any params at all. It will throw NPE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2122) In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()
[ https://issues.apache.org/jira/browse/YARN-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2122: Attachment: YARN-2122.patch In AllocationFileLoaderService, the reloadThread should be created in init() and started in start() --- Key: YARN-2122 URL: https://issues.apache.org/jira/browse/YARN-2122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Attachments: YARN-2122.patch AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015980#comment-14015980 ] Hadoop QA commented on YARN-2120: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648006/YARN-2120.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3893//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3893//console This message is automatically generated. Coloring queues running over minShare on RM Scheduler page -- Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: AD45B623-9F14-420B-B1FB-1186E2B5EC4A.png, YARN-2120.v1.patch Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2122) In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()
[ https://issues.apache.org/jira/browse/YARN-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015986#comment-14015986 ] Tsuyoshi OZAWA commented on YARN-2122: -- Thank you for taking this JIRA, [~rkanter]. I think your patch fixes the issue itself. I have one comment - how about overriding serviceInit()/serviceStart()/serviceStop() instead of init()/start()/stop()? Should we do this on another JIRA? [~kkambatl], what do you think? In AllocationFileLoaderService, the reloadThread should be created in init() and started in start() --- Key: YARN-2122 URL: https://issues.apache.org/jira/browse/YARN-2122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Attachments: YARN-2122.patch AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2121) TimelineAuthenticator#hasDelegationToken may throw NPE
[ https://issues.apache.org/jira/browse/YARN-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016010#comment-14016010 ] Hadoop QA commented on YARN-2121: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648019/YARN-2121.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.TestRMAdminCLI {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3894//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3894//console This message is automatically generated. TimelineAuthenticator#hasDelegationToken may throw NPE -- Key: YARN-2121 URL: https://issues.apache.org/jira/browse/YARN-2121 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2121.1.patch {code} private boolean hasDelegationToken(URL url) { return url.getQuery().contains( TimelineAuthenticationConsts.DELEGATION_PARAM + =); } {code} If the given url doesn't have any params at all. It will throw NPE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016019#comment-14016019 ] Tsuyoshi OZAWA commented on YARN-2091: -- [~bikassaha], thank you for the comments. {quote} Should we check if the exitCode has already been set (eg. via a kill event) and if its not set then set it from exitEvent? How can we check if the exitCode has not been set? Maybe have some uninitialized/invalid default value. {quote} IIUC, we can distinguish its set value from default value by checking whether exitCode is ContainerExitStatus.INVALID because default value of {{exitCode}} is ContainerExitStatus.INVALID. Do you have any comment about this? {code} if (container.exitCode == ContainerExitStatus.INVALID) { container.exitCode = exitEvent.getExitCode(); } {code} About the new exit status, I'll update comments in the next patch. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch, YARN-2091.5.patch, YARN-2091.6.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016027#comment-14016027 ] Ashwin Shankar commented on YARN-2120: -- [~l201514], It would be helpful if we don't remove color codes for 'above/below fair share' since we don't always set minshare for queues. In your proposal,for cases where we don't set minShare,the usage would start orange and would be the same for below and above fair share. I know that there is dotted line to mark fair share,but that is too faint and I generally need to squint to find it, especially when there are a lot of queues in the cluster. Coloring queues running over minShare on RM Scheduler page -- Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: AD45B623-9F14-420B-B1FB-1186E2B5EC4A.png, YARN-2120.v1.patch Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2122) In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()
[ https://issues.apache.org/jira/browse/YARN-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016032#comment-14016032 ] Hadoop QA commented on YARN-2122: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648023/YARN-2122.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3895//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3895//console This message is automatically generated. In AllocationFileLoaderService, the reloadThread should be created in init() and started in start() --- Key: YARN-2122 URL: https://issues.apache.org/jira/browse/YARN-2122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Attachments: YARN-2122.patch AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2122) In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()
[ https://issues.apache.org/jira/browse/YARN-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016035#comment-14016035 ] Karthik Kambatla commented on YARN-2122: Good point, [~ozawa]. It would definitely be better to override serviceInit, serivceStart, serviceStop. In AllocationFileLoaderService, the reloadThread should be created in init() and started in start() --- Key: YARN-2122 URL: https://issues.apache.org/jira/browse/YARN-2122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Attachments: YARN-2122.patch AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016050#comment-14016050 ] Ashwin Shankar commented on YARN-2026: -- Hi [~sandyr], did you have any comments ? basically in the above scenario fair share policy tends to look like fifo, since the users who submitted apps first, hog the cluster, although all users have same fair share. Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2120) Coloring queues running over minShare on RM Scheduler page
[ https://issues.apache.org/jira/browse/YARN-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016066#comment-14016066 ] Siqi Li commented on YARN-2120: --- [~ashwinshankar77] thanks for your feedback, let me see if I can find a way to retain the original format Coloring queues running over minShare on RM Scheduler page -- Key: YARN-2120 URL: https://issues.apache.org/jira/browse/YARN-2120 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.3.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: AD45B623-9F14-420B-B1FB-1186E2B5EC4A.png, YARN-2120.v1.patch Today RM Scheduler page shows FairShare, Used, Used (over fair share) and MaxCapacity. Since fairShare is displaying with dotted line, I think we can stop displaying orange when a queue over its fairshare. It would be better to show a queue running over minShare with orange color, so that we know queue is running more than its min share. Also, we can display a queue running at maxShare with red color. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2122) In AllocationFileLoaderService, the reloadThread should be created in init() and started in start()
[ https://issues.apache.org/jira/browse/YARN-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016075#comment-14016075 ] Hadoop QA commented on YARN-2122: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12648031/YARN-2122.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3896//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3896//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3896//console This message is automatically generated. In AllocationFileLoaderService, the reloadThread should be created in init() and started in start() --- Key: YARN-2122 URL: https://issues.apache.org/jira/browse/YARN-2122 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Attachments: YARN-2122.patch, YARN-2122.patch AllcoationFileLoaderService has this reloadThread that is currently created and started in start(). Instead, it should be created in init() and started in start(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
[ https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016081#comment-14016081 ] Bikas Saha commented on YARN-2091: -- If we are sure that the default value is set in the code to ContainerExitStatus.INVALID then sounds good. Given that ContainerExitStatus.INVALID == 5000 we have to explicitly initialize with that value since Java will default to 0. Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters --- Key: YARN-2091 URL: https://issues.apache.org/jira/browse/YARN-2091 Project: Hadoop YARN Issue Type: Task Reporter: Bikas Saha Assignee: Tsuyoshi OZAWA Attachments: YARN-2091.1.patch, YARN-2091.2.patch, YARN-2091.3.patch, YARN-2091.4.patch, YARN-2091.5.patch, YARN-2091.6.patch Currently, the AM cannot programmatically determine if the task was killed due to using excessive memory. The NM kills it without passing this information in the container status back to the RM. So the AM cannot take any action here. The jira tracks adding this exit status and passing it from the NM to the RM and then the AM. In general, there may be other such actions taken by YARN that are currently opaque to the AM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
[ https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016116#comment-14016116 ] Hudson commented on YARN-1913: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5646 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5646/]) YARN-1913. With Fair Scheduler, cluster can logjam when all resources are consumed by AMs (Wei Yan via Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1599400) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationFileLoaderService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestAllocationFileLoaderService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm With Fair Scheduler, cluster can logjam when all resources are consumed by AMs -- Key: YARN-1913 URL: https://issues.apache.org/jira/browse/YARN-1913 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Wei Yan Fix For: 2.5.0 Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, YARN-1913.patch It's possible to deadlock a cluster by submitting many applications at once, and have all cluster resources taken up by AMs. One solution is for the scheduler to limit resources taken up by AMs, as a percentage of total cluster resources, via a maxApplicationMasterShare config. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016151#comment-14016151 ] Carlo Curino commented on YARN-2022: Hi Sunil, I read the doc with [~chris.douglas] and [~subru], and we agree with the general direction, though you will have to be very careful to test this thoroughly as you are enforcing rather tricky invariants. A couple of specific concerns: 1) The yarn.resourcemanager.monitor.capacity.preemption.am_container_limit you propose I think it is a bit overkill. I understand the intent to allow for a more tunable preemption of AMs, but I worry this is so esoteric of a parameter that people will not know how to use it. I personally would have to think very hard to figure out exactly what different configuration of this will give me in terms of increasing/decreasing the chances of an AM to survive preemption, and in terms of improving overal cluster efficiency. I propose to enforce only based on the existing invariants (am-percentage, max-apps etc..), as the semantics are crisper: the preemption policy will re-establish the invariants of the queue no more no less. 2) Preserving the correct user mix of jobs in the queue it is also a good addition, though again I am worried this is tricky code to write, so I strongly encourage you to write many many unit tests, and test the policy on a cluster extensively before it gets committed. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Attachments: YARN-2022-DesignDraft.docx, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016155#comment-14016155 ] Sandy Ryza commented on YARN-2026: -- Hi Ashwin, have been busy with other stuff and probably will be for the next week or two. I see your point. I need to think about it a little more - the main aim of preemption is to provide enforce guarantees for purposes like maintaining SLAs. While converging towards fairness more quickly in user queues could be a nice property, it satisfies a slightly different goal. Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016191#comment-14016191 ] Ashwin Shankar commented on YARN-2026: -- [~sandyr], Sure Sandy I'll patiently wait for your response. Also if you prefer ,please feel free to point me to some other committer who knows the FS code base well. We are very interested to get this jira and YARN-1961 committed this month since its affecting our query cluster. Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)