[jira] [Created] (YARN-1447) Common PB types define for container resource change
Wangda Tan created YARN-1447: Summary: Common PB types define for container resource change Key: YARN-1447 URL: https://issues.apache.org/jira/browse/YARN-1447 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.2.0 Reporter: Wangda Tan Assignee: Wangda Tan As described in YARN-1197, we need add some common PB types for container resource change, like ResourceChangeContext, etc. These types will be both used by RM/NM protocols -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1449) Protocol changes in NM side to support change container resource
Wangda Tan created YARN-1449: Summary: Protocol changes in NM side to support change container resource Key: YARN-1449 URL: https://issues.apache.org/jira/browse/YARN-1449 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.2.0 Reporter: Wangda Tan Assignee: Wangda Tan As described in YARN-1197, we need add API in NM to support 1) Add a changeContainersResources method in ContainerManagementProtocol 2) Can get succeed/failed increased/decreased containers in response of changeContainersResources 3) Add a new decreased containers field in NodeStatus which can help NM notify RM such changes -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (YARN-1502) Protocol changes and implementations in RM side to support change container resource
Wangda Tan created YARN-1502: Summary: Protocol changes and implementations in RM side to support change container resource Key: YARN-1502 URL: https://issues.apache.org/jira/browse/YARN-1502 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan As described in YARN-1197, we need add API/implementation changes, 1) Add a ListContainerResourceIncreaseRequest to YarnScheduler interface 2) Can get resource changed containers in AllocateResponse 3) Added implementation in Capacity Scheduler side to support increase/decrease Other details, please refer to design doc and discussion in YARN-1197 -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (YARN-1509) Make AMRMClient support send increase container request and get increased/decreased containers
Wangda Tan created YARN-1509: Summary: Make AMRMClient support send increase container request and get increased/decreased containers Key: YARN-1509 URL: https://issues.apache.org/jira/browse/YARN-1509 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan As described in YARN-1197, we need add API in AMRMClient to support 1) Add increase request 2) Can get successfully increased/decreased containers from RM -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (YARN-1510) Make NMClient support change container resources
Wangda Tan created YARN-1510: Summary: Make NMClient support change container resources Key: YARN-1510 URL: https://issues.apache.org/jira/browse/YARN-1510 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: Wangda Tan As described in YARN-1197, YARN-1449, we need add API in NMClient to support 1) sending request of increase/decrease container resource limits 2) get succeeded/failed changed containers response from NM. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (YARN-1609) Add Service Container type to NodeManager in YARN
Wangda Tan created YARN-1609: Summary: Add Service Container type to NodeManager in YARN Key: YARN-1609 URL: https://issues.apache.org/jira/browse/YARN-1609 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.2.0 Reporter: Wangda Tan Assignee: Wangda Tan From our work to support running OpenMPI on YARN (MAPREDUCE-2911), we found that it’s important to have framework specific daemon process manage the tasks on each node directly. The daemon process, most likely similar in other frameworks as well, provides critical services to tasks running on that node(for example “wireup”, spawn user process in large numbers at once etc). In YARN, it’s hard, if not possible, to have the those processes to be managed by YARN. We propose to extend the container model on NodeManager side to support “Service Container” to run/manage such framework daemon/services process. We believe this is very useful to other application framework developers as well. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1644) [YARN-1197] Add newly decreased container to NodeStatus in NM side
Wangda Tan created YARN-1644: Summary: [YARN-1197] Add newly decreased container to NodeStatus in NM side Key: YARN-1644 URL: https://issues.apache.org/jira/browse/YARN-1644 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1643) [YARN-1197] Make ContainersMonitor can support change monitoring size of an allocated container in NM side
Wangda Tan created YARN-1643: Summary: [YARN-1197] Make ContainersMonitor can support change monitoring size of an allocated container in NM side Key: YARN-1643 URL: https://issues.apache.org/jira/browse/YARN-1643 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1648) [YARN-1197] Modify ApplicationMasterService to support changing container resource
Wangda Tan created YARN-1648: Summary: [YARN-1197] Modify ApplicationMasterService to support changing container resource Key: YARN-1648 URL: https://issues.apache.org/jira/browse/YARN-1648 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1651) [YARN-1197] Add methods in FiCaSchedulerApp to support add/reserve/unreserve/allocate/pull change container requests/results
Wangda Tan created YARN-1651: Summary: [YARN-1197] Add methods in FiCaSchedulerApp to support add/reserve/unreserve/allocate/pull change container requests/results Key: YARN-1651 URL: https://issues.apache.org/jira/browse/YARN-1651 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1650) [YARN-1197] Add pullDecreasedContainer API to RMNode which can be used by scheduler to get newly decreased Containers
Wangda Tan created YARN-1650: Summary: [YARN-1197] Add pullDecreasedContainer API to RMNode which can be used by scheduler to get newly decreased Containers Key: YARN-1650 URL: https://issues.apache.org/jira/browse/YARN-1650 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1646) [YARN-1197] Add increase container request to YarnScheduler allocate API
Wangda Tan created YARN-1646: Summary: [YARN-1197] Add increase container request to YarnScheduler allocate API Key: YARN-1646 URL: https://issues.apache.org/jira/browse/YARN-1646 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1649) [YARN-1197] Modify ResourceTrackerService to support passing decreased containers to RMNode
Wangda Tan created YARN-1649: Summary: [YARN-1197] Modify ResourceTrackerService to support passing decreased containers to RMNode Key: YARN-1649 URL: https://issues.apache.org/jira/browse/YARN-1649 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1647) [YARN-1197] Add increased/decreased container to Allocation
Wangda Tan created YARN-1647: Summary: [YARN-1197] Add increased/decreased container to Allocation Key: YARN-1647 URL: https://issues.apache.org/jira/browse/YARN-1647 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1654) [YARN-1197] Add implementations to CapacityScheduler to support increase/decrease container resource
Wangda Tan created YARN-1654: Summary: [YARN-1197] Add implementations to CapacityScheduler to support increase/decrease container resource Key: YARN-1654 URL: https://issues.apache.org/jira/browse/YARN-1654 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1652) [YARN-1197] Add methods in FiCaSchedulerNode to support increase/decrease/reserve/unreserve change container requests/results
Wangda Tan created YARN-1652: Summary: [YARN-1197] Add methods in FiCaSchedulerNode to support increase/decrease/reserve/unreserve change container requests/results Key: YARN-1652 URL: https://issues.apache.org/jira/browse/YARN-1652 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1655) [YARN-1197] Add implementations to FairScheduler to support increase/decrease container resource
Wangda Tan created YARN-1655: Summary: [YARN-1197] Add implementations to FairScheduler to support increase/decrease container resource Key: YARN-1655 URL: https://issues.apache.org/jira/browse/YARN-1655 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1653) [YARN-1197] Add APIs in CSQueue to support decrease container resource and unreserve increase request
Wangda Tan created YARN-1653: Summary: [YARN-1197] Add APIs in CSQueue to support decrease container resource and unreserve increase request Key: YARN-1653 URL: https://issues.apache.org/jira/browse/YARN-1653 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (YARN-1871) We should eliminate writing *PBImpl code in YARN
Wangda Tan created YARN-1871: Summary: We should eliminate writing *PBImpl code in YARN Key: YARN-1871 URL: https://issues.apache.org/jira/browse/YARN-1871 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Currently, We need write PBImpl classes one by one. After running find . -name *PBImpl*.java | xargs wc -l under hadoop source code directory, we can see, there're more than 25,000 LOC. I think we should improve this, which will be very helpful for YARN developers to make changes for YARN protocols. There're only some limited patterns in current *PBImpl, * Simple types, like string, int32, float. * List? types * Map? types * Enum types Code generation should be enough to generate such PBImpl classes. Some other requirements are, * Leave other related code alone, like service implemention (e.g. ContainerManagerImpl). * (If possible) Forward compatibility, developpers can write their own PBImpl or genereate them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1917) Add waitForCompletion interface to YarnClient
Wangda Tan created YARN-1917: Summary: Add waitForCompletion interface to YarnClient Key: YARN-1917 URL: https://issues.apache.org/jira/browse/YARN-1917 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 2.4.0 Reporter: Wangda Tan Currently, YARN dosen't have this method. Users needs to write implementations like UnmanagedAMLauncher.monitorApplication or mapreduce.Job.monitorAndPrintJob on their own. This feature should be helpful to end users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1927) Preemption message shouldn’t be created multiple times for same container-id in ProportionalCapacityPreemptionPolicy
Wangda Tan created YARN-1927: Summary: Preemption message shouldn’t be created multiple times for same container-id in ProportionalCapacityPreemptionPolicy Key: YARN-1927 URL: https://issues.apache.org/jira/browse/YARN-1927 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.4.0 Reporter: Wangda Tan Priority: Minor Currently, after each editSchedule() called, preemption message will be created and sent to scheduler. ProportionalCapacityPreemptionPolicy should only send preemption message once for each container. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2104) Scheduler queue filter failed to work because index of queue column changed
Wangda Tan created YARN-2104: Summary: Scheduler queue filter failed to work because index of queue column changed Key: YARN-2104 URL: https://issues.apache.org/jira/browse/YARN-2104 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan YARN-563 added, {code} + th(.type, Application Type”). {code} to application table, which makes queue’s column index from 3 to 4. And in scheduler page, queue’s column index is hard coded to 3 when filter application with queue’s name, {code} if (q == 'root') q = '';, else q = '^' + q.substr(q.lastIndexOf('.') + 1) + '$';, $('#apps').dataTable().fnFilter(q, 3, true);, {code} So queue filter will not work for application page. Reproduce steps: (Thanks Bo Yang for pointing this) {code} 1) In default setup, there’s a default queue under root queue 2) Run an arbitrary application, you can find it in “Applications” page 3) Click “Default” queue in scheduler page 4) Click “Applications”, no application will show here 5) Click “Root” queue in scheduler page 6) Click “Applications”, application will show again {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2124) ProportionalCapacityPreemptionPolicy cannot work because it's initialized before scheduler initialized
Wangda Tan created YARN-2124: Summary: ProportionalCapacityPreemptionPolicy cannot work because it's initialized before scheduler initialized Key: YARN-2124 URL: https://issues.apache.org/jira/browse/YARN-2124 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 3.0.0 Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical When I play with scheduler with preemption, I found ProportionalCapacityPreemptionPolicy cannot work. NPE will be raised when RM start {code} 2014-06-05 11:01:33,201 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.util.resource.Resources.greaterThan(Resources.java:225) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.computeIdealResourceDistribution(ProportionalCapacityPreemptionPolicy.java:302) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.recursivelyComputeIdealAssignment(ProportionalCapacityPreemptionPolicy.java:261) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:198) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:174) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82) at java.lang.Thread.run(Thread.java:744) {code} This is caused by ProportionalCapacityPreemptionPolicy needs ResourceCalculator from CapacityScheduler. But ProportionalCapacityPreemptionPolicy get initialized before CapacityScheduler initialized. So ResourceCalculator will set to null in ProportionalCapacityPreemptionPolicy. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2125) ProportionalCapacityPreemptionPolicy should only log CSV when debug enabled
Wangda Tan created YARN-2125: Summary: ProportionalCapacityPreemptionPolicy should only log CSV when debug enabled Key: YARN-2125 URL: https://issues.apache.org/jira/browse/YARN-2125 Project: Hadoop YARN Issue Type: Task Components: resourcemanager, scheduler Affects Versions: 3.0.0 Reporter: Wangda Tan Assignee: Wangda Tan Priority: Minor Attachments: YARN-2125.patch Currently, logToCSV() will be output using LOG.info() in ProportionalCapacityPreemptionPolicy. Which will generate non-human-readable texts in resource manager's log every several seconds, like {code} ... 2014-06-05 15:57:07,603 INFO org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy: QUEUESTATE: 1401955027603, a1, 4096, 3, 2048, 2, 4096, 3, 4096, 3, 0, 0, 0, 0, b1, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0, b2, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0 2014-06-05 15:57:10,603 INFO org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy: QUEUESTATE: 1401955030603, a1, 4096, 3, 2048, 2, 4096, 3, 4096, 3, 0, 0, 0, 0, b1, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0, b2, 3072, 2, 1024, 1, 3072, 2, 3072, 2, 0, 0, 0, 0 ... {code} It's better to make it output when debug enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2143) Merge common killContainer logic of Fair/Capacity scheduler into AbstractYarnScheduler
Wangda Tan created YARN-2143: Summary: Merge common killContainer logic of Fair/Capacity scheduler into AbstractYarnScheduler Key: YARN-2143 URL: https://issues.apache.org/jira/browse/YARN-2143 Project: Hadoop YARN Issue Type: Task Components: resourcemanager, scheduler Reporter: Wangda Tan Currently, CapacityScheduler has killContainer API inherited from PreemptableResourceScheduler, and FairScheduler uses warnOrKillContainer to do container preemption. We'd better to merge common code to kill container into AbstractYarnScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2148) TestNMClient failed due more exit code values added and passed to AM
Wangda Tan created YARN-2148: Summary: TestNMClient failed due more exit code values added and passed to AM Key: YARN-2148 URL: https://issues.apache.org/jira/browse/YARN-2148 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 3.0.0 Reporter: Wangda Tan Currently, TestNMClient will be failed in trunk, see https://builds.apache.org/job/PreCommit-YARN-Build/3959/testReport/junit/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClient/ {code} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testGetContainerStatus(TestNMClient.java:385) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testContainerManagement(TestNMClient.java:347) at org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClient(TestNMClient.java:226) {code} Test cases in TestNMClient uses following code to verify exit code of COMPLETED containers {code} testGetContainerStatus(container, i, ContainerState.COMPLETE, Container killed by the ApplicationMaster., Arrays.asList( new Integer[] {137, 143, 0})); {code} But YARN-2091 added logic to make exit code reflecting the actual status, so exit code of the killed by ApplicationMaster will be -105, {code} if (container.hasDefaultExitCode()) { container.exitCode = exitEvent.getExitCode(); } {code} We should update test case as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2149) Test failed in TestRMAdminCLI
Wangda Tan created YARN-2149: Summary: Test failed in TestRMAdminCLI Key: YARN-2149 URL: https://issues.apache.org/jira/browse/YARN-2149 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan I noticed there're two test failures in TestRMAdminCLI, 1) testHelp https://builds.apache.org/job/PreCommit-YARN-Build/3959//testReport/org.apache.hadoop.yarn.client/TestRMAdminCLI/testHelp/ {code} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} This should be caused by --forceactive recently added to transitionToActive 2) testTransitionToActive https://builds.apache.org/job/PreCommit-YARN-Build/3959/testReport/junit/org.apache.hadoop.yarn.client/TestRMAdminCLI/testTransitionToActive/ {code} java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) {code} This is caused by ArrayList doesn't implement remove interface -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2149) Test failed in TestRMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-2149. -- Resolution: Duplicate Assignee: Wangda Tan Test failed in TestRMAdminCLI - Key: YARN-2149 URL: https://issues.apache.org/jira/browse/YARN-2149 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Assignee: Wangda Tan I noticed there're two test failures in TestRMAdminCLI, 1) testHelp https://builds.apache.org/job/PreCommit-YARN-Build/3959//testReport/org.apache.hadoop.yarn.client/TestRMAdminCLI/testHelp/ {code} java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testError(TestRMAdminCLI.java:366) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:307) {code} This should be caused by --forceactive recently added to transitionToActive 2) testTransitionToActive https://builds.apache.org/job/PreCommit-YARN-Build/3959/testReport/junit/org.apache.hadoop.yarn.client/TestRMAdminCLI/testTransitionToActive/ {code} java.lang.UnsupportedOperationException: null at java.util.AbstractList.remove(AbstractList.java:144) at java.util.AbstractList$Itr.remove(AbstractList.java:360) at java.util.AbstractCollection.remove(AbstractCollection.java:252) at org.apache.hadoop.ha.HAAdmin.isOtherTargetNodeActive(HAAdmin.java:173) at org.apache.hadoop.ha.HAAdmin.transitionToActive(HAAdmin.java:144) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:447) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:380) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:318) at org.apache.hadoop.yarn.client.TestRMAdminCLI.testTransitionToActive(TestRMAdminCLI.java:180) {code} This is caused by ArrayList doesn't implement remove interface -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2181) Add preemption info to RM Web UI
Wangda Tan created YARN-2181: Summary: Add preemption info to RM Web UI Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2191) Add a test to make sure NM will do application cleanup even if RM restarting happens before application completed
Wangda Tan created YARN-2191: Summary: Add a test to make sure NM will do application cleanup even if RM restarting happens before application completed Key: YARN-2191 URL: https://issues.apache.org/jira/browse/YARN-2191 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan In YARN-1885, there's a test in TestApplicationCleanup#testAppCleanupWhenRestartedAfterAppFinished. But this is not enough, we need one more test to make sure NM will do app cleanup when restart happens before app finished. The sequence is, 1. Submit app1 to RM1 2. NM1 launches app1's AM (container-0), NM2 launches app1's task containers. 3. Restart RM1 4. Before RM1 finishes restarting, container-0 completed in NM1 5. RM1 finishes restarting, NM1 report container-0 completed, app1 will be completed 6. RM1 should be able to notify NM1/NM2 to cleanup app1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2271) Add application attempt metrics to RM Web UI/service when AppAttempt page available
Wangda Tan created YARN-2271: Summary: Add application attempt metrics to RM Web UI/service when AppAttempt page available Key: YARN-2271 URL: https://issues.apache.org/jira/browse/YARN-2271 Project: Hadoop YARN Issue Type: Task Reporter: Wangda Tan Currently, we only show application metrics in RM Web UI at application page (YARN-2181). An application attempt page is planed to add to RM Web UI. After that, we should add attempt metrics to that page and web service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches
[ https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-2258. -- Resolution: Duplicate Assignee: Wangda Tan Aggregation of MR job logs failing when Resourcemanager switches Key: YARN-2258 URL: https://issues.apache.org/jira/browse/YARN-2258 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Wangda Tan 1.Install RM in HA mode 2.Run a job with more tasks 3.Induce RM switchover while job is in progress Observe that log aggregation fails for the job which is running when Resourcemanager switchover is induced. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
Wangda Tan created YARN-2308: Summary: NPE happened when RM restart after CapacityScheduler queue configuration changed Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2492) [Umbrella] Allow for (admin) labels on nodes and resource-requests
Wangda Tan created YARN-2492: Summary: [Umbrella] Allow for (admin) labels on nodes and resource-requests Key: YARN-2492 URL: https://issues.apache.org/jira/browse/YARN-2492 Project: Hadoop YARN Issue Type: Task Components: api, client, resourcemanager Reporter: Wangda Tan Since YARN-796 is a sub JIRA of YARN-397, this JIRA is used to create and track sub tasks and attach split patches for YARN-796. Let's keep all over-all discussions on YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2493) [YARN-796] API changes for users
Wangda Tan created YARN-2493: Summary: [YARN-796] API changes for users Key: YARN-2493 URL: https://issues.apache.org/jira/browse/YARN-2493 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Wangda Tan Assignee: Wangda Tan This JIRA includes API changes for users of YARN-796, like changes in {{ResourceRequest}}, {{ApplicationSubmissionContext}}, etc. This is a common part of YARN-796. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2494) [YARN-796] Node label manager API and storage implementations
Wangda Tan created YARN-2494: Summary: [YARN-796] Node label manager API and storage implementations Key: YARN-2494 URL: https://issues.apache.org/jira/browse/YARN-2494 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan This JIRA includes APIs and storage implementations of node label manager, NodeLabelManager is an abstract class used to manage labels of nodes in the cluster, it has APIs to query/modify - Nodes according to given label - Labels according to given hostname - Add/remove labels - Set labels of nodes in the cluster - Persist/recover changes of labels/labels-on-nodes to/from storage And it has two implementations to store modifications - Memory based storage: It will not persist changes, so all labels will be lost when RM restart - FileSystem based storage: It will persist/recover to/from FileSystem (like HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2497) [YARN-796] Changes for fair scheduler to support allocate resource respect labels
Wangda Tan created YARN-2497: Summary: [YARN-796] Changes for fair scheduler to support allocate resource respect labels Key: YARN-2497 URL: https://issues.apache.org/jira/browse/YARN-2497 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
Wangda Tan created YARN-2500: Summary: [YARN-796] Miscellaneous changes in ResourceManager to support labels Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2499) [YARN-796] Respect labels in preemption policy of fair scheduler
Wangda Tan created YARN-2499: Summary: [YARN-796] Respect labels in preemption policy of fair scheduler Key: YARN-2499 URL: https://issues.apache.org/jira/browse/YARN-2499 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2503) [YARN-796] Changes in RM Web UI to better show labels to end users
Wangda Tan created YARN-2503: Summary: [YARN-796] Changes in RM Web UI to better show labels to end users Key: YARN-2503 URL: https://issues.apache.org/jira/browse/YARN-2503 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Include but not limited to: - Show labels of nodes in RM/nodes page - Show labels of queue in RM/scheduler page - Warn user/admin if capacity of queue cannot be guaranteed according to mis config of labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2504) [YARN-796] Support get/add/remove/change labels in RM admin CLI
Wangda Tan created YARN-2504: Summary: [YARN-796] Support get/add/remove/change labels in RM admin CLI Key: YARN-2504 URL: https://issues.apache.org/jira/browse/YARN-2504 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2505) [YARN-796] Support get/add/remove/change labels in RM REST API
Wangda Tan created YARN-2505: Summary: [YARN-796] Support get/add/remove/change labels in RM REST API Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2544) [YARN-796] Common server side PB changes (not include user API PB changes)
Wangda Tan created YARN-2544: Summary: [YARN-796] Common server side PB changes (not include user API PB changes) Key: YARN-2544 URL: https://issues.apache.org/jira/browse/YARN-2544 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2637) maximum-am-resource-percent will be violated when resource of AM is minimumAllocation
Wangda Tan created YARN-2637: Summary: maximum-am-resource-percent will be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Priority: Critical Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2647) [YARN-796] Add yarn queue CLI to get queue info including labels of such queue
Wangda Tan created YARN-2647: Summary: [YARN-796] Add yarn queue CLI to get queue info including labels of such queue Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2665) Audit warning of registry project
Wangda Tan created YARN-2665: Summary: Audit warning of registry project Key: YARN-2665 URL: https://issues.apache.org/jira/browse/YARN-2665 Project: Hadoop YARN Issue Type: Bug Components: site Reporter: Wangda Tan Assignee: Steve Loughran Priority: Minor I encountered one audit warning today: See: https://issues.apache.org/jira/browse/YARN-2544?focusedCommentId=14164515page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14164515 It seems caused by recent committed registry project. {code} !? /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep Lines that start with ? in the release audit report indicate files that do not have an Apache license header. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2685) Resource on each label not correct when multiple NMs in a same host and some has label some not
Wangda Tan created YARN-2685: Summary: Resource on each label not correct when multiple NMs in a same host and some has label some not Key: YARN-2685 URL: https://issues.apache.org/jira/browse/YARN-2685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan I noticed there's one issue, when we have multiple NMs running in a same host, (say NM1-4 running in host1). And we specify some of them has label and some not, the total resource on label is not correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2694) Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY
Wangda Tan created YARN-2694: Summary: Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest with multiple node labels will make user limit computation becomes tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2695) Support continuously looking reserved container with node labels
Wangda Tan created YARN-2695: Summary: Support continuously looking reserved container with node labels Key: YARN-2695 URL: https://issues.apache.org/jira/browse/YARN-2695 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan YARN-1769 improved capacity scheduler to continuously look at reserved container when trying to reserve/allocate resource. This should be respected when node/resource-request/queue has node label. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2698) Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI
Wangda Tan created YARN-2698: Summary: Move getClusterNodeLabels and getNodeToLabels to YARN CLI instead of RMAdminCLI Key: YARN-2698 URL: https://issues.apache.org/jira/browse/YARN-2698 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan YARN RMAdminCLI and AdminService should have write API only, for other read APIs, they should be located at YARNCLI and RMClientService. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2699) Fix test timeout in TestResourceTrackerOnHA#testResourceTrackerOnHA
Wangda Tan created YARN-2699: Summary: Fix test timeout in TestResourceTrackerOnHA#testResourceTrackerOnHA Key: YARN-2699 URL: https://issues.apache.org/jira/browse/YARN-2699 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Wangda Tan Because of changes by YARN-2500/YARN-2496/YARN-2494, now registering a node manager with port=0 is not allowed. TestResourceTrackerOnHA#testResourceTrackerOnHA will be failed since it register a node manager with port = 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2705) Changes of RM node label manager configuration
Wangda Tan created YARN-2705: Summary: Changes of RM node label manager configuration Key: YARN-2705 URL: https://issues.apache.org/jira/browse/YARN-2705 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan 1) Add yarn.node-labels.manager-class, by default it's will not store anything to file system 2) Use above at least in some places: RMNodeLabelsManager, RMAdminCLI. Convert {{DummyNodeLabelsManager}} into a {{MemoryNodeLabelsManager}} 3) Document that RM configs and client configs for yarn.node-labels.manager-class should match 4) fs-store.uri - fs-store.root-dir 5) Similarly FS_NODE_LABELS_STORE_URI 6) For default value of fs-store.uri, put it in /tmp. But creaat /tmp/hadoop-yarn-$\{user\}/node-labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2710) RM HA tests failed intermittently on trunk
Wangda Tan created YARN-2710: Summary: RM HA tests failed intermittently on trunk Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2740) RM AdminService should prevent admin change labels on nodes when distributed node label configuration enabled
Wangda Tan created YARN-2740: Summary: RM AdminService should prevent admin change labels on nodes when distributed node label configuration enabled Key: YARN-2740 URL: https://issues.apache.org/jira/browse/YARN-2740 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan According to YARN-2495, labels of nodes will be specified when NM do heartbeat. We shouldn't allow admin modify labels on nodes when distributed node label configuration enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2778) YARN node CLI should display labels on returned node reports
Wangda Tan created YARN-2778: Summary: YARN node CLI should display labels on returned node reports Key: YARN-2778 URL: https://issues.apache.org/jira/browse/YARN-2778 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2786) Create yarn node-labels CLI to enable list node labels collection and node labels mapping
Wangda Tan created YARN-2786: Summary: Create yarn node-labels CLI to enable list node labels collection and node labels mapping Key: YARN-2786 URL: https://issues.apache.org/jira/browse/YARN-2786 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan With YARN-2778, we can list node labels on existing RM nodes. But it is not enough, we should be able to: 1) list node labels collection 2) list node-to-label mappings even if the node hasn't registered to RM. The command should start with yarn node-labels ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2800) Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled
Wangda Tan created YARN-2800: Summary: Should print WARN log in both RM/RMAdminCLI side when MemoryRMNodeLabelsManager is enabled Key: YARN-2800 URL: https://issues.apache.org/jira/browse/YARN-2800 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Even though we have documented this, but it will be better to explicitly print a message in both RM/RMAdminCLI side to explicitly say that the node label being added will be lost across RM restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2807) Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive
Wangda Tan created YARN-2807: Summary: Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive Key: YARN-2807 URL: https://issues.apache.org/jira/browse/YARN-2807 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Currently the help message of yarn rmadmin -transitionToActive is: {code} transitionToActive: incorrect number of arguments Usage: HAAdmin [-transitionToActive serviceId [--forceactive]] {code} But the --forceactive not works as expected. When transition RM state with --forceactive: {code} yarn rmadmin -transitionToActive rm2 --forceactive Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@64c9f31e Refusing to manually manage HA state, since it may cause a split-brain scenario or other incorrect state. If you are very sure you know what you are doing, please specify the forcemanual flag. {code} As shown above, we still cannot transitionToActive when automatic failover is enabled with --forceactive. The option can work is: {{--forcemanual}}, there's no place in usage describes this option. I think we should fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2824) Capacity of labels should be zero by default
Wangda Tan created YARN-2824: Summary: Capacity of labels should be zero by default Key: YARN-2824 URL: https://issues.apache.org/jira/browse/YARN-2824 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical In existing Capacity Scheduler behavior, if user doesn't specify capacity of label, queue initialization will be failed. That will cause queue refreshment failed when add a new label to node labels collection and doesn't modify capacity-scheduler.xml. With this patch, capacity of labels should be explicitly set if user want to use it. If user doesn't set capacity of some labels, we will treat such labels are unused labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2827) Fix bugs of yarn queue CLI
Wangda Tan created YARN-2827: Summary: Fix bugs of yarn queue CLI Key: YARN-2827 URL: https://issues.apache.org/jira/browse/YARN-2827 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Need fix bugs: 1) args of queue CLI is without queue even if you run with yarn queue -status .., the args is [-status, ...]. The assumption is incorrect. 2) It is possible that there's no related QueueInfo with specified queue name, and null will be returned from YarnClient, so NPE will raise. Added a check for it, and will print proper message 3) When failed to get QueueInfo, should return non-zero exit code. 4) Add tests for above. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2866) Capacity scheduler preemption policy should respect yarn.scheduler.minimum-allocation-mb when computing resource of queues
Wangda Tan created YARN-2866: Summary: Capacity scheduler preemption policy should respect yarn.scheduler.minimum-allocation-mb when computing resource of queues Key: YARN-2866 URL: https://issues.apache.org/jira/browse/YARN-2866 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Currently, capacity scheduler preemption logic doesn't respect minimum_allocation when computing ideal_assign/guaranteed_resource, etc. We should respect it to avoid some potential rounding issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2869) CapacityScheduler should trim sub queue names when parse configuration
Wangda Tan created YARN-2869: Summary: CapacityScheduler should trim sub queue names when parse configuration Key: YARN-2869 URL: https://issues.apache.org/jira/browse/YARN-2869 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Currently, capacity scheduler doesn't trim sub queue name when parsing queue names, for example, the configuration {code} configuration property name...root.queues/name value a, b , c/value /property property name...root.b.capacity/name value100/value /property ... /property {code} Will fail with error: {code} java.lang.IllegalArgumentException: Illegal capacity of -1.0 for queue root. a at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration.getCapacity(CapacitySchedulerConfiguration.java:332) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getCapacityFromConf(LeafQueue.java:196) {code} It will try to find a queues with name a, b , and c, which is apparently wrong, we should do trimming on these sub queue names. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
Wangda Tan created YARN-2880: Summary: Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2895) Integrate distributed scheduling with capacity scheduler
Wangda Tan created YARN-2895: Summary: Integrate distributed scheduling with capacity scheduler Key: YARN-2895 URL: https://issues.apache.org/jira/browse/YARN-2895 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Wangda Tan Assignee: Wangda Tan There're some benefit to integrate distributed scheduling mechanism (LocalRM) with capacity scheduler: - Resource usage of opportunistic container can be tracked by central RM and capacity could be enforced - Opportunity to transfer opportunistic container to conservative container -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2920) CapacityScheduler should be notified when labels on nodes changed
Wangda Tan created YARN-2920: Summary: CapacityScheduler should be notified when labels on nodes changed Key: YARN-2920 URL: https://issues.apache.org/jira/browse/YARN-2920 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Currently, labels on nodes changes will only be handled by RMNodeLabelsManager, but that is not enough upon labels on nodes changes: - Scheduler should be able to do take actions to running containers. (Like kill/preempt/do-nothing) - Used / available capacity in scheduler should be updated for future planning. We need add a new event to pass such updates to scheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2925) Internal fields in LeafQueue access should be protected when accessed from FiCaSchedulerApp to calculate Headroom
Wangda Tan created YARN-2925: Summary: Internal fields in LeafQueue access should be protected when accessed from FiCaSchedulerApp to calculate Headroom Key: YARN-2925 URL: https://issues.apache.org/jira/browse/YARN-2925 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical Upon YARN-2644, FiCaScheduler will calculation up-to-date headroom before sending back Allocation response to AM. Headroom calculation is happened in LeafQueue side, uses fields like used resource, etc. But it is not protected by any lock of LeafQueue, so it might be corrupted is someone else is editing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
Wangda Tan created YARN-2933: Summary: Capacity Scheduler preemption policy should only consider capacity without labels temporarily Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2943) Add a node-labels page in RM web UI
Wangda Tan created YARN-2943: Summary: Add a node-labels page in RM web UI Key: YARN-2943 URL: https://issues.apache.org/jira/browse/YARN-2943 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Now we have node labels in the system, but there's no a very convenient to get information like how many active NM(s) assigned to a given label?, how much total resource for a give label?, For a given label, which queues can access it?, etc. It will be better to add a node-labels page in RM web UI, users/admins can have a centralized view to see such information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3098) Create common QueueCapacities class in Capacity Scheduler to track capacities-by-labels of queues
Wangda Tan created YARN-3098: Summary: Create common QueueCapacities class in Capacity Scheduler to track capacities-by-labels of queues Key: YARN-3098 URL: https://issues.apache.org/jira/browse/YARN-3098 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Similar to YARN-3092, after YARN-796, now queues (ParentQueue and LeafQueue) need to track capacities-label (e.g. absolute-capacity, maximum-capacity, absolute-capacity, absolute-maximum-capacity, etc.). It's better to have a class to encapsulate these capacities to make both better maintainability/readability and fine-grained locking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3099) Capacity Scheduler LeafQueue/ParentQueue should use ResourceUsage to track resources-by-label.
Wangda Tan created YARN-3099: Summary: Capacity Scheduler LeafQueue/ParentQueue should use ResourceUsage to track resources-by-label. Key: YARN-3099 URL: https://issues.apache.org/jira/browse/YARN-3099 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3142) Improve locks in AppSchedulingInfo
Wangda Tan created YARN-3142: Summary: Improve locks in AppSchedulingInfo Key: YARN-3142 URL: https://issues.apache.org/jira/browse/YARN-3142 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3139) Improve locks in AbstractYarnScheduler/CapacityScheduler/FairScheduler
Wangda Tan created YARN-3139: Summary: Improve locks in AbstractYarnScheduler/CapacityScheduler/FairScheduler Key: YARN-3139 URL: https://issues.apache.org/jira/browse/YARN-3139 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Wangda Tan Assignee: Li Lu Enhance locks in AbstractYarnScheduler/CapacityScheduler/FairScheduler, as mentioned in YARN-3091, a possible solution is using read/write lock. Other fine-graind locks for specific purposes / bugs should be addressed in separated tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3124) Capacity Scheduler LeafQueue/ParentQueue should use QueueCapacities to track capacities-by-label
Wangda Tan created YARN-3124: Summary: Capacity Scheduler LeafQueue/ParentQueue should use QueueCapacities to track capacities-by-label Key: YARN-3124 URL: https://issues.apache.org/jira/browse/YARN-3124 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-281) Add a test for YARN Schedulers' MAXIMUM_ALLOCATION limits
[ https://issues.apache.org/jira/browse/YARN-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-281. - Resolution: Won't Fix Release Note: I think this may not need since we already have tests in TestSchedulerUitls, it will verify minimum/maximum resource normalization/verification. And SchedulerUtil runs before scheduler can see such resource requests. Resolved it as won't fix. Add a test for YARN Schedulers' MAXIMUM_ALLOCATION limits - Key: YARN-281 URL: https://issues.apache.org/jira/browse/YARN-281 Project: Hadoop YARN Issue Type: Test Components: scheduler Affects Versions: 2.0.0-alpha Reporter: Harsh J Assignee: Wangda Tan Labels: test We currently have tests that test MINIMUM_ALLOCATION limits for FifoScheduler and the likes, but no test for MAXIMUM_ALLOCATION yet. We should add a test to prevent regressions of any kind on such limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3153) Capacity Scheduler max AM resource percentage is mis-used as ratio
Wangda Tan created YARN-3153: Summary: Capacity Scheduler max AM resource percentage is mis-used as ratio Key: YARN-3153 URL: https://issues.apache.org/jira/browse/YARN-3153 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Priority: Critical In existing Capacity Scheduler, it can limit max applications running within a queue. The config is yarn.scheduler.capacity.maximum-am-resource-percent, but actually, it is used as ratio, in implementation, it assumes input will be \[0,1\]. So now user can specify it up to 100, which makes AM can use 100x of queue capacity. We should fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3132) RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated
Wangda Tan created YARN-3132: Summary: RMNodeLabelsManager should remove node from node-to-label mapping when node becomes deactivated Key: YARN-3132 URL: https://issues.apache.org/jira/browse/YARN-3132 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Using an example to explain: 1) Admin specify host1 has label=x 2) node=host1:123 registered 3) Get node-to-label mapping, return host1/host1:123 4) node=host1:123 unregistered 5) Get node-to-label mapping, still returns host1:123 Probably we should remove host1:123 when it becomes deactivated and no directly label assigned to it (directly assign means admin specify host1:123 has x instead of host1 has 123). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3213) Respect labels in Capacity Scheduler when computing user-limit
Wangda Tan created YARN-3213: Summary: Respect labels in Capacity Scheduler when computing user-limit Key: YARN-3213 URL: https://issues.apache.org/jira/browse/YARN-3213 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Now we can support node-labels in Capacity Scheduler, but user-limit computing doesn't respect node-labels enough, we should fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3233) Implement scheduler common configuration parser and create abstraction layer in CapacityScheduler to support plain/hierarchy configuration.
Wangda Tan created YARN-3233: Summary: Implement scheduler common configuration parser and create abstraction layer in CapacityScheduler to support plain/hierarchy configuration. Key: YARN-3233 URL: https://issues.apache.org/jira/browse/YARN-3233 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3214) Adding non-exclusive node labels
Wangda Tan created YARN-3214: Summary: Adding non-exclusive node labels Key: YARN-3214 URL: https://issues.apache.org/jira/browse/YARN-3214 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Currently node labels partition the cluster to some sub-clusters so resources cannot be shared between partitioned cluster. With the current implementation of node labels we cannot use the cluster optimally and the throughput of the cluster will suffer. We are proposing adding non-exclusive node labels: 1. Labeled apps get the preference on Labeled nodes 2. If there is no ask for labeled resources we can assign those nodes to non labeled apps 3. If there is any future ask for those resources , we will preempt the non labeled apps and give them back to labeled apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3216) Max-AM-Resource-Percentage should respect node labels
Wangda Tan created YARN-3216: Summary: Max-AM-Resource-Percentage should respect node labels Key: YARN-3216 URL: https://issues.apache.org/jira/browse/YARN-3216 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3026) Move application-specific container allocation logic from LeafQueue to FiCaSchedulerApp
Wangda Tan created YARN-3026: Summary: Move application-specific container allocation logic from LeafQueue to FiCaSchedulerApp Key: YARN-3026 URL: https://issues.apache.org/jira/browse/YARN-3026 Project: Hadoop YARN Issue Type: Task Components: capacityscheduler Reporter: Wangda Tan Assignee: Wangda Tan Have a discussion with [~vinodkv] and [~jianhe]: In existing Capacity Scheduler, all allocation logics of and under LeafQueue are located in LeafQueue.java in implementation. To make a cleaner scope of LeafQueue, we'd better move some of them to FiCaSchedulerApp. Ideal scope of LeafQueue should be: when a LeafQueue receives some resources from ParentQueue (like 15% of cluster resource), and it distributes resources to children apps, and it should be agnostic to internal logic of children apps (like delayed-scheduling, etc.). IAW, LeafQueue shouldn't decide how application allocating container from given resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3016) (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in CommonNodeLabelsManager
Wangda Tan created YARN-3016: Summary: (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in CommonNodeLabelsManager Key: YARN-3016 URL: https://issues.apache.org/jira/browse/YARN-3016 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Now we have separated but similar implementations for add/remove/replace labels on node in CommonNodeLabelsManager, we should merge it to a single one for easier modify them and better readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3014) Changing labels on a host should update all NM's labels on that host
Wangda Tan created YARN-3014: Summary: Changing labels on a host should update all NM's labels on that host Key: YARN-3014 URL: https://issues.apache.org/jira/browse/YARN-3014 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3234) Add changes in CapacityScheduler to use the abstracted configuration layer
Wangda Tan created YARN-3234: Summary: Add changes in CapacityScheduler to use the abstracted configuration layer Key: YARN-3234 URL: https://issues.apache.org/jira/browse/YARN-3234 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3235) Support uniformed scheduler configuration in FairScheduler
Wangda Tan created YARN-3235: Summary: Support uniformed scheduler configuration in FairScheduler Key: YARN-3235 URL: https://issues.apache.org/jira/browse/YARN-3235 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3354) Container should contains node-labels asked by original ResourceRequests
Wangda Tan created YARN-3354: Summary: Container should contains node-labels asked by original ResourceRequests Key: YARN-3354 URL: https://issues.apache.org/jira/browse/YARN-3354 Project: Hadoop YARN Issue Type: Sub-task Components: api, capacityscheduler, nodemanager, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan We proposed non-exclusive node labels in YARN-3214, makes non-labeled resource requests can be allocated on labeled nodes which has idle resources. To make preemption work, we need know an allocated container's original node label: when labeled resource requests comes back, we need kill non-labeled containers running on labeled nodes. This requires add node-labels in Container, and also, NM need store this information and send back to RM when RM restart to recover original container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3356) Capacity Scheduler LeafQueue.User/FiCaSchedulerApp should use ResourceUsage to track used-resources-by-label.
Wangda Tan created YARN-3356: Summary: Capacity Scheduler LeafQueue.User/FiCaSchedulerApp should use ResourceUsage to track used-resources-by-label. Key: YARN-3356 URL: https://issues.apache.org/jira/browse/YARN-3356 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Simliar to YARN-3099, Capacity Scheduler's LeafQueue.User/FiCaSchedulerApp should use ResourceRequest to track resource-usage/pending by label for better resource tracking and preemption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3361) CapacityScheduler side changes to support non-exclusive node labels
Wangda Tan created YARN-3361: Summary: CapacityScheduler side changes to support non-exclusive node labels Key: YARN-3361 URL: https://issues.apache.org/jira/browse/YARN-3361 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Reference to design doc attached in YARN-3214, this is CapacityScheduler side changes to support non-exclusive node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3362) Add node label usage in RM CapacityScheduler web UI
Wangda Tan created YARN-3362: Summary: Add node label usage in RM CapacityScheduler web UI Key: YARN-3362 URL: https://issues.apache.org/jira/browse/YARN-3362 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, webapp Reporter: Wangda Tan We don't have node label usage in RM CapacityScheduler web UI now, without this, user will be hard to understand what happened to nodes have labels assign to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3383) AdminService should use warn instead of info to log exception when operation fails
Wangda Tan created YARN-3383: Summary: AdminService should use warn instead of info to log exception when operation fails Key: YARN-3383 URL: https://issues.apache.org/jira/browse/YARN-3383 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Wangda Tan Now it uses info: {code} private YarnException logAndWrapException(IOException ioe, String user, String argName, String msg) throws YarnException { LOG.info(Exception + msg, ioe); {code} But it should use warn instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3298) User-limit should be enforced in CapacityScheduler
Wangda Tan created YARN-3298: Summary: User-limit should be enforced in CapacityScheduler Key: YARN-3298 URL: https://issues.apache.org/jira/browse/YARN-3298 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, yarn Reporter: Wangda Tan Assignee: Wangda Tan User-limit is not treat as a hard-limit for now, it will not consider required-resource (resource of being-allocated resource request). And also, when user's used resource equals to user-limit, it will still continue. This will generate jitter issues when we have YARN-2069 (preemption policy kills a container under an user, and scheduler allocate a container under the same user soon after). The expected behavior should be as same as queue's capacity: Only when user.usage + required = user-limit, queue will continue to allocate container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3340) Mark setters to be @Public for ApplicationId/ApplicationAttemptId/ContainerId.
Wangda Tan created YARN-3340: Summary: Mark setters to be @Public for ApplicationId/ApplicationAttemptId/ContainerId. Key: YARN-3340 URL: https://issues.apache.org/jira/browse/YARN-3340 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Priority: Blocker Currently, setters of ApplicaitonId/ApplicationAttemptId/ContainerId are all private, that's not correct -- user's applications need to set such ids to do query status / submit application, etc. We need mark such setters to be public avoiding downstream applications encounters compilation error when changes made on such setters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3346) Deadlock in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-3346. -- Resolution: Implemented Fix Version/s: 2.6.1 2.7.0 This issue is already resolved: YARN-3251 for 2.6.1 fix, and YARN-3265 for 2.7.0 fix. Deadlock in Capacity Scheduler -- Key: YARN-3346 URL: https://issues.apache.org/jira/browse/YARN-3346 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Suma Shivaprasad Fix For: 2.7.0, 2.6.1 Attachments: rm.deadlock_jstack {noformat} Found one Java-level deadlock: = 2144051991@qtp-383501499-6: waiting to lock monitor 0x7fa700eec8e8 (object 0x0004589fec18, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp), which is held by ResourceManager Event Processor ResourceManager Event Processor: waiting to lock monitor 0x7fa700aadf88 (object 0x000441c05ec8, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by IPC Server handler 0 on 54311 IPC Server handler 0 on 54311: waiting to lock monitor 0x7fa700e20798 (object 0x000441d867f8, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by ResourceManager Event Processor {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3345) Add non-exclusive node label RMAdmin CLI/API
Wangda Tan created YARN-3345: Summary: Add non-exclusive node label RMAdmin CLI/API Key: YARN-3345 URL: https://issues.apache.org/jira/browse/YARN-3345 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan As described in YARN-3214 (see design doc attached to that JIRA), we need add non-exclusive node label RMAdmin API and CLI implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3279) AvailableResource of QueueMetrics should consider queue's current-max-limit
Wangda Tan created YARN-3279: Summary: AvailableResource of QueueMetrics should consider queue's current-max-limit Key: YARN-3279 URL: https://issues.apache.org/jira/browse/YARN-3279 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Now, available resource of queue doesn't consider queue's current-max-limit, but available resource of user already considered that, we should make them consistent. And in addition, we can make code better organized, now computation of AvailableResource of QueueMetrics/UserMetrics are placed in two places, we should merge them to a single place. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3277) Queue's current-max-limit should be updated before allocate reserved container.
Wangda Tan created YARN-3277: Summary: Queue's current-max-limit should be updated before allocate reserved container. Key: YARN-3277 URL: https://issues.apache.org/jira/browse/YARN-3277 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan This is introduced by changes of YARN-3265. With YARN-2008, when RM allocates reserved container, it goes to LeafQueue directly, then goes up to root to get LeafQueue's current-max-limit correct. Now we will not go up, so LeafQueue cannot get maxQueueLimit updated before allocating reserved container. One possible solution is, we can still start from root when allocate reserved container, but goes to LeafQueue which reserved container belongs to from top to bottom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
Wangda Tan created YARN-3243: Summary: CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits. Key: YARN-3243 URL: https://issues.apache.org/jira/browse/YARN-3243 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Now CapacityScheduler has some issues to make sure ParentQueue always obeys its capacity limits, for example: 1) When allocating container of a parent queue, it will only check parentQueue.usage parentQueue.max. If leaf queue allocated a container.size (parentQueue.max - parentQueue.usage), parent queue can excess its max resource limit, as following example: {code} A (usage=54, max=55) / \ A1 A2 (usage=1, max=55) (usage=53, max=53) {code} Queue-A2 is able to allocate container since its usage max, but if we do that, A's usage can excess A.max. 2) When doing continous reservation check, parent queue will only tell children you need unreserve *some* resource, so that I will less than my maximum resource, but it will not tell how many resource need to be unreserved. This may lead to parent queue excesses configured maximum capacity as well. With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, *here is my proposal*: - ParentQueue will set its children's ResourceUsage.headroom, which means, *maximum resource its children can allocate*. - ParentQueue will set its children's headroom to be (saying parent's name is qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's ancestors' capacity will be enforced as well (qA.headroom is set by qA's parent). - {{needToUnReserve}} is not necessary, instead, children can get how much resource need to be unreserved to keep its parent's resource limit. - More over, with this, YARN-3026 will make a clear boundary between LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3003) Provide API for client to retrieve label to node mapping
[ https://issues.apache.org/jira/browse/YARN-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-3003. -- Resolution: Duplicate [~varun_saxena], Thanks for reminding, I just reopen and then resolved as duplicated, since patch of this JIRA is divided to other two JIRAs, there's no code actually committed for this one. Provide API for client to retrieve label to node mapping Key: YARN-3003 URL: https://issues.apache.org/jira/browse/YARN-3003 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Ted Yu Assignee: Varun Saxena Attachments: YARN-3003.001.patch, YARN-3003.002.patch Currently YarnClient#getNodeToLabels() returns the mapping from NodeId to set of labels associated with the node. Client (such as Slider) may be interested in label to node mapping - given label, return the nodes with this label. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3265) CapacityScheduler deadlock when computing absolute max avail capacity (fix for trunk/branch-2)
Wangda Tan created YARN-3265: Summary: CapacityScheduler deadlock when computing absolute max avail capacity (fix for trunk/branch-2) Key: YARN-3265 URL: https://issues.apache.org/jira/browse/YARN-3265 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan This patch is trying to solve the same problem described in YARN-3251, but this is a longer term fix for trunk and branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3251) Fix CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)
[ https://issues.apache.org/jira/browse/YARN-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-3251. -- Resolution: Fixed Fix Version/s: 2.6.1 Hadoop Flags: Reviewed Just compiled and ran all tests in CapacityScheduler, committed to branch-2.6. Thanks [~cwelch] and also reviews from [~jlowe], [~sunilg] and [~vinodkv]. Fix CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1) Key: YARN-3251 URL: https://issues.apache.org/jira/browse/YARN-3251 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Craig Welch Priority: Blocker Fix For: 2.6.1 Attachments: YARN-3251.1.patch, YARN-3251.2-6-0.2.patch, YARN-3251.2-6-0.3.patch, YARN-3251.2-6-0.4.patch, YARN-3251.2.patch The ResourceManager can deadlock in the CapacityScheduler when computing the absolute max available capacity for user limits and headroom. -- This message was sent by Atlassian JIRA (v6.3.4#6332)