[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110340#comment-14110340 ] Tsuyoshi OZAWA commented on YARN-2452: -- Thanks for your contribution, [~zxu]. This is just my anticipation, but I think some tests depend on CapacityScheduler. Should we fix all of them? About a patch itself, It's better to use FairSchedulerConfiguration.ASSIGN_MULTIPLE instead of hard coding property. {code} +conf.setBoolean(yarn.scheduler.fair.assignmultiple, true); {code} TestRMApplicationHistoryWriter is failed for FairScheduler -- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110354#comment-14110354 ] Beckham007 commented on YARN-810: - Hi, [~ywskycn] and [~vvasudev]. Both this issue and YARN-2440 are doing cpu core isolation for containers. In our production cluster, if the number of vcore is more than pcore, the nm will crash(The system processes couldn't get cpu time). So these issues are worthy. But using cfs_quota_us and cfs_period_us makes too many changes in LCE, even we have modify ContainerLauche, I think cpu/memory/diskio could be the first class for resource isolation. But cfs_quota_us and cfs_period_us should be second. I also think refactoring the LCE to support more cgroups subsystems, as YARN-2139 and YARN-2140. In this case, we could use cpuset for cpu core isolation. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110366#comment-14110366 ] zhihai xu commented on YARN-2452: - [~Tsuyoshi OZAWA] thanks for the review. I try to use FairSchedulerConfiguration.ASSIGN_MULTIPLE at the beginning. then I get compilation error, it is because ASSIGN_MULTIPLE is protected, which can't be accessed by the test. {code} protected static final String ASSIGN_MULTIPLE = CONF_PREFIX + assignmultiple; {code} Can I change protected to public at above code? TestRMApplicationHistoryWriter is failed for FairScheduler -- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
Xu Yang created YARN-2454: - Summary: The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Reporter: Xu Yang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110378#comment-14110378 ] Tsuyoshi OZAWA commented on YARN-2452: -- Thanks for your explanation. I don't know why this property is protected. [~kkambatl], [~sandyr], can we make FairSchedulerConfiguration.ASSIGN_MULTIPLE public? Or shouldn't we do that? TestRMApplicationHistoryWriter is failed for FairScheduler -- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2453: Attachment: YARN-2453.000.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA reassigned YARN-2454: Assignee: Tsuyoshi OZAWA The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Reporter: Xu Yang Assignee: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110388#comment-14110388 ] Beckham007 commented on YARN-2454: -- The compareTo() of Resource UNBOUNDED is copied from Resource NONE. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Reporter: Xu Yang Assignee: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2455) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is defined wrong.
Xu Yang created YARN-2455: - Summary: The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is defined wrong. Key: YARN-2455 URL: https://issues.apache.org/jira/browse/YARN-2455 Project: Hadoop YARN Issue Type: Bug Reporter: Xu Yang The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Beckham007 updated YARN-2454: - Labels: (was: patch) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Assignee: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Beckham007 updated YARN-2454: - Attachment: YARN-2454-patch.diff The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Assignee: Tsuyoshi OZAWA Attachments: YARN-2454-patch.diff -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2454: - Assignee: (was: Tsuyoshi OZAWA) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Yang updated YARN-2454: -- Description: The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110402#comment-14110402 ] Tsuyoshi OZAWA commented on YARN-2454: -- Thanks for your contribution, [~beckham007]. The fix itself looks good to me. How about adding tests to TestResources like this? {code} @Test(timeout=1000) public void testCompareToWithUnboundedResource() { assertTrue(Resources.unbounded().compareTo( createResource(Integer.MAX_VALUE, Integer.MAX_VALUE)) == 0); assertTrue(Resources.unbounded().compareTo( createResource(Integer.MAX_VALUE, 0)) 0); assertTrue(Resources.unbounded().compareTo( createResource(0, Integer.MAX_VALUE)) 0); } @Test(timeout=1000) public void testCompareToWithNoneResource() { assertTrue(Resources.none().compareTo(createResource(0, 0)) == 0); assertTrue(Resources.none().compareTo( createResource(1, 0)) 0); assertTrue(Resources.none().compareTo( createResource(0, 1)) 0); } {code} The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110404#comment-14110404 ] Tsuyoshi OZAWA commented on YARN-2454: -- sorry, testCompareToWithNoneResource is wrong. A fixed version is as follows: {code} @Test(timeout=1000) public void testCompareToWithNoneResource() { assertTrue(Resources.none().compareTo(createResource(0, 0)) == 0); assertTrue(Resources.none().compareTo( createResource(1, 0)) 0); assertTrue(Resources.none().compareTo( createResource(0, 1)) 0); } {code} The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110410#comment-14110410 ] Beckham007 commented on YARN-2454: -- Hi, [~ozawa]. It could be assertTrue(Resources.unbounded().compareTo( createResource(Integer.MAX_VALUE, 0)) ** 0) ? The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2455) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is defined wrong.
[ https://issues.apache.org/jira/browse/YARN-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Yang resolved YARN-2455. --- Resolution: Duplicate Look the same issue YARN-2454. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is defined wrong. Key: YARN-2455 URL: https://issues.apache.org/jira/browse/YARN-2455 Project: Hadoop YARN Issue Type: Bug Reporter: Xu Yang The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110413#comment-14110413 ] Tsuyoshi OZAWA commented on YARN-2454: -- Oops, you're right. Could you update it? The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110414#comment-14110414 ] Wei Yan commented on YARN-2454: --- One more thing, the NONE also need to update to UNBOUNDED. {code} @Override public void setMemory(int memory) { throw new RuntimeException(NONE cannot be modified!); } @Override public void setVirtualCores(int cores) { throw new RuntimeException(NONE cannot be modified!); } {code} The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110423#comment-14110423 ] Beckham007 commented on YARN-2454: -- I have talked with [~yxls123123], he will update this patch. Thanks~ The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110427#comment-14110427 ] Hadoop QA commented on YARN-2453: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664328/YARN-2453.000.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4731//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4731//console This message is automatically generated. TestProportionalCapacityPreemptionPolicy is failed for FairScheduler Key: YARN-2453 URL: https://issues.apache.org/jira/browse/YARN-2453 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2453.000.patch TestProportionalCapacityPreemptionPolicy is failed for FairScheduler. The following is error message: Running org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy) Time elapsed: 1.61 sec FAILURE! java.lang.AssertionError: Failed to find SchedulingMonitor service, please check what happened at org.junit.Assert.fail(Assert.java:88) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469) This test should only work for capacity scheduler because the following source code in ResourceManager.java prove it will only work for capacity scheduler. {code} if (scheduler instanceof PreemptableResourceScheduler conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) { {code} Because CapacityScheduler is instance of PreemptableResourceScheduler and FairScheduler is not instance of PreemptableResourceScheduler. I will upload a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110426#comment-14110426 ] Hadoop QA commented on YARN-2454: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664331/YARN-2454-patch.diff against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4732//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4732//console This message is automatically generated. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Yang updated YARN-2454: -- Attachment: YARN-2454.patch Thank you for your suggest, Beckham007, Tsuyoshi OZAWA and Wei Yan. I fixed it and added two Tests. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110472#comment-14110472 ] Hadoop QA commented on YARN-2454: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664350/YARN-2454.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4733//console This message is automatically generated. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110483#comment-14110483 ] Tsuyoshi OZAWA commented on YARN-2454: -- [~yxls123123], please generate your patch at root directory of the source code. Additional minor nits: how about moving the tests to org.apache.hadoop.yarn.server.resource.TestResorces instead of adding a new file? The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110525#comment-14110525 ] Xu Yang commented on YARN-2454: --- [~te...@uproadx.com], Thank you for your suggest. I genetate a new patch at root directory. About moving these tests to org.apache.hadoop.yarn.server.resourcemanager.resource.Resources, I think it will be strange. Whereas moving the latter to the file that I created is a better way. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Yang updated YARN-2454: -- Attachment: YARN-2454 -v2.patch The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454 -v2.patch, YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Yang updated YARN-2454: -- Attachment: (was: YARN-2454 .patch) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454 -v2.patch, YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110555#comment-14110555 ] Varun Vasudev commented on YARN-2440: - [~jlowe] the example provided by [~sjlee0] is the one I wanted to address when I added support for both percentage and absolute cores. Would it make more sense if I picked the lower value instead of one overriding the other. Something like - 1. Evaluate cores allocated by yarn.nodemanager.containers-cpu-cores and yarn.nodemanager.containers-cpu-percentage. 2. Pick the lower of the two values 3. Log a warning/info message that both were specified and that we're picking the lower value. {quote} I'm not thrilled about the name template containers-cpu-* since it could easily be misinterpreted as a per-container thing as well, but I'm currently at a loss for a better prefix. Suggestions welcome. {quote} How about yarn.nodemanager.all-containers-cpu-cores and yarn.nodemanager.all-containers-cpu-percentage? {quote} Does getOverallLimits need to check for a quotaUS that's too low as well? {quote} Thanks for catching this; I'll fix it in the next patch. {quote} I think minimally we need to log a warning if we're going to ignore setting up cgroups to limit CPU usage across all containers if the user specified to do so. {quote} I'll add in the logging message. {quote} Related to the previous comment, I think it would be nice if we didn't try to setup any limits if none were specified. That way if there's some issue with correctly determining the number of cores on a particular system it can still work in the default, use everything scenario. {quote} Will do. {quote} NodeManagerHardwareUtils.getContainerCores should be getContainersCores (the per-container vs. all-containers confusion again) {quote} I'll rename the function. Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110556#comment-14110556 ] Varun Vasudev commented on YARN-2440: - [~beckham007] the current implementation of Cgroups uses cpu instead of cpuset, probably due to the flexibility offered(sharing the cores is handled by the kernel). Is there any particular benefit to cpuset? Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110566#comment-14110566 ] Beckham007 commented on YARN-2440: -- hi, [~vvasudev] “Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks.” https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt For a NM has 24 pcores, we can use cpuset subsystem to make hadoop-yarn use cpu core 0-21, and left the others((22,23) for system. And then using cpu.shares to share the pcore 0-21. What's more, we can assign a pcore(such as core 21) to run a long-running container, and other containers only share pcore 0-20. Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110567#comment-14110567 ] Beckham007 commented on YARN-2440: -- In addition, mesos uses cpuset for default. https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110571#comment-14110571 ] Hadoop QA commented on YARN-2454: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664364/YARN-2454%20-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4734//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4734//console This message is automatically generated. The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Attachments: YARN-2454 -v2.patch, YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110815#comment-14110815 ] Eric Payne commented on YARN-415: - bq. -1 release audit. The applied patch generated 3 release audit warnings. Files triggering audit warnings not part of this patch: {{EncryptionFaultInjector.java}}, {{EncryptionZoneManager.java }}, and {{EncryptionZoneWithId.java}} {quote} -1 core tests. The patch failed these unit tests org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {quote} This test failure is intermittent and does not seem to be caused by this patch. Please see: https://builds.apache.org/job/PreCommit-YARN-Build/4711/ https://builds.apache.org/job/PreCommit-YARN-Build/4727/ [~jianhe] and [~kkambatl], I really appreciate all of your help in reviewing this patch and making it better with your suggestions. How close are we to getting this patch submitted? Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2385) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110849#comment-14110849 ] Sunil G commented on YARN-2385: --- Yes [~subru], _moveAllApps_ is also using this api. bq. If not maybe we should defer the splitting till we have a concrete use case? Now the behavior of *getAppsInQueue* in _killAllAppsInQueue_, _getApplications_ and _moveAllApps_ is different with Capacity Scheduler and Fair Scheduler. Hence I feel that we can split up and make a uniform behavior in all these caller sides. How do you feel? Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Krishnan Labels: abstractyarnscheduler Currently getAppsinQueue returns both pending running apps. The purpose of the JIRA is to explore splitting it to getRunningAppsInQueue + getPendingAppsInQueue that will provide more flexibility to callers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2102) More generalized timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110899#comment-14110899 ] Zhijie Shen commented on YARN-2102: --- Remove the last patch, as the method doesn't work given the group in the reader/writer list. More generalized timeline ACLs -- Key: YARN-2102 URL: https://issues.apache.org/jira/browse/YARN-2102 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: GeneralizedTimelineACLs.pdf, YARN-2102.1.patch, YARN-2102.2.patch, YARN-2102.3.patch We need to differentiate the access controls of reading and writing operations, and we need to think about cross-entity access control. For example, if we are executing a workflow of MR jobs, which writing the timeline data of this workflow, we don't want other user to pollute the timeline data of the workflow by putting something under it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2102) More generalized timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2102: -- Attachment: (was: YARN-2102.4.patch) More generalized timeline ACLs -- Key: YARN-2102 URL: https://issues.apache.org/jira/browse/YARN-2102 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: GeneralizedTimelineACLs.pdf, YARN-2102.1.patch, YARN-2102.2.patch, YARN-2102.3.patch We need to differentiate the access controls of reading and writing operations, and we need to think about cross-entity access control. For example, if we are executing a workflow of MR jobs, which writing the timeline data of this workflow, we don't want other user to pollute the timeline data of the workflow by putting something under it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2452: Attachment: YARN-2452.001.patch TestRMApplicationHistoryWriter is failed for FairScheduler -- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch, YARN-2452.001.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110968#comment-14110968 ] zhihai xu commented on YARN-2452: - I uploaded a new patch YARN-2452.001.patch. It splits testRMWritingMassiveHistory into two tests: testRMWritingMassiveHistoryForFairSche and testRMWritingMassiveHistoryForCapacitySche.One for fair scheduler and one for Capacity scheduler. So we can test both schedulers. TestRMApplicationHistoryWriter is failed for FairScheduler -- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch, YARN-2452.001.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111015#comment-14111015 ] Junping Du commented on YARN-2033: -- [~zjshen], the patch looks good to me in over all. I don't have much more comments, except below. Would you fix it? bq. First of all, two queries are not duplicate: one to read the application entity, and the other to read the app attempt entity, and we previously distinguish ApplicationNotFoundException and ApplicationAttemptNotFoundException. It is always possible that App1 exists in the store with the only attempt AppAttempt1 while the user looks up for AppAttempt2. In this case, we know App1 is there, but AppAttempt2 isn't, so we will throw ApplicationAttemptNotFoundException. If we really want to differ the two exceptions, we still can check ApplicationAttempt first and check Application later (to see if throw ApplicationNotFoundException instead) if ApplicationAttemptNotFoundException get thrown there. This is more efficient as we only need to visit DB one time in most cases. Isn't it? Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111073#comment-14111073 ] Hadoop QA commented on YARN-2452: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664426/YARN-2452.001.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4735//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4735//console This message is automatically generated. TestRMApplicationHistoryWriter is failed for FairScheduler -- Key: YARN-2452 URL: https://issues.apache.org/jira/browse/YARN-2452 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2452.000.patch, YARN-2452.001.patch TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is the following: T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter) Time elapsed: 66.261 sec FAILURE! java.lang.AssertionError: expected:1 but was:200 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430) at org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2395: -- Attachment: YARN-2395-3.patch FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, YARN-2395-3.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111090#comment-14111090 ] Karthik Kambatla commented on YARN-2395: bq. Which means starvation at parent queues would not be detected and preemption at parent will not happen. Am I missing something ? If a parent queue is starved, wouldn't at least one of the child queues starve? Is there a counter example? FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, YARN-2395-3.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1431#comment-1431 ] Karthik Kambatla commented on YARN-2395: I think I now get Ashwin's point. Ashwin - please correct me if I am wrong. If the parent queue has a timeout of 5 seconds and all the child queues have a timeout of 30 seconds, any preemption under the parent queue kicks in only after 30 seconds and not 5 seconds. I am not sure we can really do much in this case. It can be a case of misconfiguration that we might want to warn. But again, if a leaf queue were to be created under, that would inherit these timeouts. FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, YARN-2395-3.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1448#comment-1448 ] Ashwin Shankar commented on YARN-2395: -- [~kasha], bq. If a parent queue is starved, wouldn't at least one of the child queues starve? Not always. Here is an example : Queue hierarchy : root.lowPriorityLeaf - fair share = 10% root.HighPriorityParent - fair share = 90% fairSharePreemptionThreshold=1 root.HighPriorityParent.child(1-10) Scenario : Apps running in root.lowPriorityLeaf, root.HighPriorityParent.child1, root.HighPriorityParent.child2(Remember we now have fair share for active queues) Following situation is possible : root.lowPriorityLeaf : *usage = 55% demand = 55% fair share = 10%* root.HighPriorityParent.child1: *usage = 45% demand = 85% fair share = 45%* root.HighPriorityParent.child2 : usage = 5% demand = 5% fair share = 45% In above example, low priority queue with fair share 10% is taking up 55% of the cluster, while HighPriorityParent.child1 needs 85%, but can get only 45% through preemption since thats its fair share. Another point is HighPriorityParent.child2 has a fair share of 45%, but needs only 5%. *Note that both child1,child2 are NOT starved, but HighPriorityParent is starved.* Use case is basically this : We want ALL 90% of the cluster resources to go to HighPriorityParent whenever its needed by ANY of its children. We can do that by detecting starvation at parent HighPriorityParent and preempt from lowPriorityLeaf. FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, YARN-2395-3.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111235#comment-14111235 ] Jian He commented on YARN-1372: --- [~adhoot], Any thoughts on my last comments ? bq. the same justFinishedContainers set can be used to return to AM and ack NMs? bq. I meant can we remove all the containers in NMContext for the application once we received the NodeHeartbeatResponse#getApplicationsToCleanup notification, instead of depending on expiration. bq. I meant is it possible for NM at DECOMMISSIONED/LOST state to receive the newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we need to handle them too. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps
Jian He created YARN-2456: - Summary: Possible deadlock in CapacityScheduler when RM is recovering apps Key: YARN-2456 URL: https://issues.apache.org/jira/browse/YARN-2456 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Consider this scenario: 1. RM is configured with a single queue and only one application can be active at a time. 2. Submit App1 which uses up the queue's whole capacity 3. Submit App2 which remains pending. 4. Restart RM. 5. App2 is recovered before App1, so App2 is added to the activeApplications list. Now App1 remains pending (because of max-active-app limit) 6. All containers of App1 are now recovered when NM registers, and use up the whole queue capacity again. 7. Since the queue is full, App2 cannot proceed to allocate AM container. 8. In the meanwhile, App1 cannot proceed to become active because of the max-active-app limit -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111471#comment-14111471 ] Jason Lowe commented on YARN-2440: -- For the case presented by [~sjlee0] the user has an 8 core system and wants to use at most 6 cores for YARN containers. That can be done by simply setting containers-cpu-percentage to 75. I don't see why we need a separate containers-cpu-cores parameter here, and I think it causes more problems than it solves per my previous comment. If we only want to support whole-core granularity then I can see containers-cpu-cores as a better choice, but otherwise containers-cpu-percentage is more flexible. Also I don't see vcores being relevant for this JIRA. The way vcores map to physical cores is node-dependent, but apps ask for vcores in a node-independent fashion. IIUC this JIRA is focused on simply limiting the amount of CPU all YARN containers on the node can possibly use in aggregate. Changing the vcore-to-core ratio on the node will change how many containers the node might run simultaneously, but it shouldn't impact how much of the physical CPU the user wants reserved for non-container processes. On a related note, it's interesting to step back and see if this is really what most users will want in practice. If the intent is to ensure the NM, DN, and other system processes get enough CPU time then I think a better approach is to put those system processes in a peer cgroup to the YARN containers cgroup and set their relative CPU shares accordingly. Then YARN containers can continue to use any spare CPU if desired (i.e.: no CPU fragmentation) but the system processes are guaranteed not to be starved out by the YARN containers. Some users may want a hard limit and hence why this feature would be useful for them, but I suspect most users will not want to leave spare CPU lying around when containers need it. bq. How about yarn.nodemanager.all-containers-cpu-cores and yarn.nodemanager.all-containers-cpu-percentage? I'm indifferent on adding all as a prefix. Something like yarn.nodemanager.containers-limit-cpu-percentage might be more clear that this is a hard limit and CPUs can go idle even if containers are demanding more from the machine than this limit. Cgroups should allow YARN containers to be limited to allocated cores - Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, apache-yarn-2440.2.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111520#comment-14111520 ] Carlo Curino commented on YARN-1707: Wangda, Thanks for the great feedback. You spot a bunch of oddities that were there due to previous versions of the reservation system, but not needed anymore, I think the updated version is definitely cleaner. We address 1,2 by: * moving addQueue and removeQueue to PlanQueue (as they were only invoked on instances of the subclass). * making uniform checks from within the PlanQueue for capacity 0, and throw uniform SchedulerConfigEditException * fixing the log, and making the logs more uniform We address 3,6 by: * merge addCapacity and subtractCapacity into a single changeCapacity * make checks of range limits 0,1 (this reduced code both in CS and ReservationQueue... good call!) We address 4 by: * getReservableQueues() has been renamed to getPlanQueues() Regarding 5: ReservationQueue#getQueueName * This is the result of our previous conversations with Vinod, Bikas, and Arun. The idea is that the user should not be aware of the fact that we use queues to implement reservations, and thus it shouldn't see the name of the reservation queue to be listed in the UI, but rather the name of the parent PlanQueue. More precisely, we have options for the UI to show or not the subqueues, but this differentiation is needed here to allow that: getQueueName for a ReservationQueue return the parent, while getReservationQueueName() returns the actual local name. Regarding 7: DynamicQueueConf * We currently are only dynamically assigning capacity, but you can imagine in the future that this is extended to set many more parameters for a queue (user-limit factors, max applications, etc..). The conf-based mechanism is future-proofing against this. Regarding 8: ParentQueue#setChildQueues * I don't understand the comment. This check is automatically bypassed for PlanQueue (that by design have no children see CapacityScheduler near line 562). We are testing the new version of the patch now, and will post patch soon. Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1990) Track time-to-allocation for different size containers
[ https://issues.apache.org/jira/browse/YARN-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111535#comment-14111535 ] Carlo Curino commented on YARN-1990: We had a simple implementation for this, using QueueMetrics and maintaining a map of delays (using SampleQuantile), tracking start-end of the wait-time by catching reserve() and unreserve() calls. In our test environment, it didn't seem to matter much. We might further investigate, and post patches after YARN-1051 is committed. Track time-to-allocation for different size containers --- Key: YARN-1990 URL: https://issues.apache.org/jira/browse/YARN-1990 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Allocation of Large Containers are notoriously problematic, as smaller containers can more easily grab resources. The proposal for this JIRA is to maintain a map of container sizes, and time-to-allocation, that can be used as: * general insight on cluster behavior, * to inform the reservation-system, and allows us to account for delays in allocation, so that the user reservation is respected regardless the size of containers requested. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlo Curino updated YARN-1707: --- Attachment: YARN-1707.4.patch Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111597#comment-14111597 ] Maysam Yabandeh commented on YARN-2405: --- I am thinking of a simple patch that catches the NPE at skips adding the record to appsTableData. Comments are highly appreciated. NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()
[ https://issues.apache.org/jira/browse/YARN-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated YARN-2430: -- Assignee: Sandy Ryza (was: Maysam Yabandeh) FairShareComparator: cache the results of getResourceUsage() Key: YARN-2430 URL: https://issues.apache.org/jira/browse/YARN-2430 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Sandy Ryza The compare of FairShareComparator has 3 invocation of getResourceUsage per comparable object. In the case of queues, the implementation of getResourceUsage requires iterating over the apps and adding up their current usage. The compare method can reuse the result of getResourceUsage to reduce the load by third. However, to further reduce the load the result of getResourceUsage can be cached in FSLeafQueue. This would be more efficient since the invocation of compare method on each Comparable object is = 1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1506) Replace set resource change on RMNode/SchedulerNode directly with event notification.
[ https://issues.apache.org/jira/browse/YARN-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111610#comment-14111610 ] Jian He commented on YARN-1506: --- Junping , thanks for the update. - Move the following to the previous else condition where update is performed successfully {code} RMAuditLogger.logSuccess(user.getShortUserName(), argName, AdminService); {code} - Update node’s capacity only if capacity changes ? and we may directly sends NodeResourceUpdateSchedulerEvent here, instead of making node send event to itself {code} // Update node's capacity for reconnect node. rmNode.context.getDispatcher().getEventHandler().handle( new RMNodeResourceUpdateEvent(rmNode.nodeId, ResourceOption.newInstance(rmNode.totalCapability, -1))); {code} - maybe nodeAndQueueResourceUpdate - updateNodeAndQueueResource , and similarly nodeResourceUpdate - updateNodeResource - given that we put the common method in AbstractYarnScheduler already. we can move SchedulerUtils.updateResourceOnSchedulerNode to AbstractYarnScheduler#nodeResourceUpdate also. - FiCaSchedulerNode: import only changes, we can revert. - testResourceOverCommit in CapacityScheduler and FifoScheduler are almost the same. I think we can create a new test file and use parameterize for all scheduler types. - In AdminService: we may updateNodeResource only if node resource changes? - In FairScheduler, I think we should do following too when updating resource, as in FairScheduler#addNode(): {code} updateRootQueueMetrics(); queueMgr.getRootQueue().setSteadyFairShare(clusterResource); queueMgr.getRootQueue().recomputeSteadyShares(); {code} Replace set resource change on RMNode/SchedulerNode directly with event notification. - Key: YARN-1506 URL: https://issues.apache.org/jira/browse/YARN-1506 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, scheduler Reporter: Junping Du Assignee: Junping Du Attachments: YARN-1506-v1.patch, YARN-1506-v10.patch, YARN-1506-v11.patch, YARN-1506-v12.patch, YARN-1506-v13.patch, YARN-1506-v2.patch, YARN-1506-v3.patch, YARN-1506-v4.patch, YARN-1506-v5.patch, YARN-1506-v6.patch, YARN-1506-v7.patch, YARN-1506-v8.patch, YARN-1506-v9.patch According to Vinod's comments on YARN-312 (https://issues.apache.org/jira/browse/YARN-312?focusedCommentId=13846087page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13846087), we should replace RMNode.setResourceOption() with some resource change event. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111624#comment-14111624 ] Jian He commented on YARN-2456: --- One thing we can do is to add the application to scheduler based on the application submission order. i.e. sort the apps first based on applicationId before recovering the apps Possible deadlock in CapacityScheduler when RM is recovering apps - Key: YARN-2456 URL: https://issues.apache.org/jira/browse/YARN-2456 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Consider this scenario: 1. RM is configured with a single queue and only one application can be active at a time. 2. Submit App1 which uses up the queue's whole capacity 3. Submit App2 which remains pending. 4. Restart RM. 5. App2 is recovered before App1, so App2 is added to the activeApplications list. Now App1 remains pending (because of max-active-app limit) 6. All containers of App1 are now recovered when NM registers, and use up the whole queue capacity again. 7. Since the queue is full, App2 cannot proceed to allocate AM container. 8. In the meanwhile, App1 cannot proceed to become active because of the max-active-app limit -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111710#comment-14111710 ] Tsuyoshi OZAWA commented on YARN-2405: -- Hi [~maysamyabandeh], could you tell us the version you faced this problem? NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111713#comment-14111713 ] Anubhav Dhoot commented on YARN-1372: - bq. I meant is it possible for NM at DECOMMISSIONED/LOST state to receive the newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we need to handle them too. Fixed that. bq. the same justFinishedContainers set can be used to return to AM and ack NMs? There are 3 states to completed containers in this set. a) Container added to justFinishedContainer but not yet sent to AM. b) Container sent to AM in a previous allocateResponse but is not yet acked c) Next allocate call from AM has happened after the container was sent. This implicitly acks from AM point of view and now can be sent to NM. Instead of having some additional state to track a) and b), I used 2 collections justFinishedContainers and previousJustFinishedContainers respectively. Have added tests to show that. bq. I meant can we remove all the containers in NMContext for the application once we received the NodeHeartbeatResponse#getApplicationsToCleanup notification, instead of depending on expiration. I tried doing that but had one issue. ApplicationImpl which has the mapping of application to containers, cannot access the event dispatcher for ContainerManagerImpl (which is the one removing the containers from context). I am going to upload a patch that removes the dispatcher local to ContainerManagerImpl (~/patches/YARN-1372.002_NMHandlesCompletedApp.patch). I looked into an alternate approach where the RM acks the completed containers that belong to an App thats completed. I am uploading that patch as well (~/patches/YARN-1372.002_RMHandlesCompletedApp.patch) Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1372: Attachment: YARN-1372.002_RMHandlesCompletedApp.patch Addresses feedback by having RM ack completed containers for a completed app. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1372: Attachment: YARN-1372.002_NMHandlesCompletedApp.patch Addresses feedback by having NM remove containers for a completed app from context. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111718#comment-14111718 ] Hadoop QA commented on YARN-1372: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664541/YARN-1372.002_RMHandlesCompletedApp.patch against trunk revision . {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4738//console This message is automatically generated. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2457) FairScheduler: Handle preemption to help starved parent queues
Karthik Kambatla created YARN-2457: -- Summary: FairScheduler: Handle preemption to help starved parent queues Key: YARN-2457 URL: https://issues.apache.org/jira/browse/YARN-2457 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla YARN-2395/YARN-2394 add preemption timeout and threshold per queue, but don't check for parent queue starvation. We need to check that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111722#comment-14111722 ] Karthik Kambatla commented on YARN-2395: I see the issue now. Thanks for catching it, Ashwin. Wei and I discussed this offline to see what might be the best way to handle this, and here is what I think might work: # Starved leaf queues will continue to be handled the way they are in the latest patch. # YARN-2154 changes the behavior for leaf queues to look at actual ResourceRequests and preempt only matching containers. # YARN-2457: For each starved parent queue, pick an application with positive demand and the least disadvantaged in terms of allocation and preempt containers that match that. I propose we get this in and follow up on other items in respective JIRAs. FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, YARN-2395-3.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2154) FairScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request
[ https://issues.apache.org/jira/browse/YARN-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111724#comment-14111724 ] Karthik Kambatla commented on YARN-2154: Started looking into this. Today, we just look at the amount of resources to be preempted. Instead, we should collect a list of applications for which we are preempting containers. Iterate through these applications and their ResourceRequests to find potential matches in free resources and subsequently in resources assigned to another application that is over its fairshare. Will post a patch for this once YARN-2395 and YARN-2394 get committed. FairScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request -- Key: YARN-2154 URL: https://issues.apache.org/jira/browse/YARN-2154 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Today, FairScheduler uses a spray-gun approach to preemption. Instead, it should only preempt resources that would satisfy the incoming request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111725#comment-14111725 ] Maysam Yabandeh commented on YARN-2405: --- [~ozawa], we got the error in a fork of 2.0.5 but further code inspection showed that the problem also exist in 2.5. NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2385) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111737#comment-14111737 ] Subramaniam Krishnan commented on YARN-2385: [~sunilg], the behavior of *getAppsInQueue* should be same for both CS FS unless I am missing something. As part of YARN-2378, I added pending apps also to CS#getAppsInQueue. Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Krishnan Labels: abstractyarnscheduler Currently getAppsinQueue returns both pending running apps. The purpose of the JIRA is to explore splitting it to getRunningAppsInQueue + getPendingAppsInQueue that will provide more flexibility to callers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111740#comment-14111740 ] hex108 commented on YARN-2405: -- If RMApp has not been accepted by scheduler, it will only be recorded in `MapApplicationId, RMApp rmContext.getRMApps()`. So I think we could first test whether it is in `MapApplicationId, SchedulerApplication applications`, then we decide whether to get its fair. Is it OK? NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111747#comment-14111747 ] Karthik Kambatla commented on YARN-2395: Latest patch looks good to me. +1 [~ashwinshankar77] - does the plan and the current patch look alright to you? FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, YARN-2395-3.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111757#comment-14111757 ] Tsuyoshi OZAWA commented on YARN-2405: -- [~maysamyabandeh], [~hex108], I got the problem. I think trunk code still has the issue. Let me tackle this. NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111758#comment-14111758 ] Maysam Yabandeh commented on YARN-2405: --- Sounds good to me. We also need to decide of how to react to nonexistent app: return fair share of 0, -1, or skip the whole record from appsTableData? If the the problematic record is going to be skipped, instead of putting checks inside fair share computation, we can alternatively catch the NPE at FairSchedulerAppsBlock. NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111770#comment-14111770 ] Maysam Yabandeh commented on YARN-2405: --- [~ozawa], all right then. Looking forward for your patch... NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)
[ https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA reassigned YARN-2405: Assignee: Tsuyoshi OZAWA NPE in FairSchedulerAppsBlock (scheduler page) -- Key: YARN-2405 URL: https://issues.apache.org/jira/browse/YARN-2405 Project: Hadoop YARN Issue Type: Bug Reporter: Maysam Yabandeh Assignee: Tsuyoshi OZAWA FairSchedulerAppsBlock#render throws NPE at this line {code} int fairShare = fsinfo.getAppFairShare(attemptId); {code} This causes the scheduler page now showing the app since it lack the definition of appsTableData {code} Uncaught ReferenceError: appsTableData is not defined {code} The problem is temporary meaning that it is usually resolved by itself either after a retry or after a few hours. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
[ https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Yang reassigned YARN-2454: - Assignee: Xu Yang The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong. -- Key: YARN-2454 URL: https://issues.apache.org/jira/browse/YARN-2454 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.4.1 Reporter: Xu Yang Assignee: Xu Yang Attachments: YARN-2454 -v2.patch, YARN-2454-patch.diff, YARN-2454.patch The variable UNBOUNDED implement the abstract class Resources, and override the function compareTo. But there is something wrong in this function. We should not compare resources with zero as the same as the variable NONE. We should change 0 to Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v6.2#6252)