date:20140826


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110340#comment-14110340
 ] 

Tsuyoshi OZAWA commented on YARN-2452:
--

Thanks for your contribution, [~zxu]. This is just my anticipation, but I think 
some tests depend on CapacityScheduler. Should we fix all of them?

About a patch itself, It's better to use 
FairSchedulerConfiguration.ASSIGN_MULTIPLE instead of hard coding property.
{code}
+conf.setBoolean(yarn.scheduler.fair.assignmultiple, true);
{code}


 TestRMApplicationHistoryWriter is failed for FairScheduler
 --

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2452.000.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

[
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110354#comment-14110354
]

Beckham007 commented on YARN-810:
-

Hi, [~ywskycn] and [~vvasudev]. Both this issue and YARN-2440 are doing cpu
core isolation for containers. In our production cluster, if the number of
vcore is more than pcore, the nm will crash(The system processes couldn't get
cpu time). So these issues are worthy.
But using cfs_quota_us and cfs_period_us makes too many changes in LCE, even we
have modify ContainerLauche, I think cpu/memory/diskio could be the first class
for resource isolation. But cfs_quota_us and cfs_period_us should be second.
I also think refactoring the LCE to support more cgroups subsystems, as
YARN-2139 and YARN-2140. In this case, we could use cpuset for cpu core
isolation.

Support CGroup ceiling enforcement on CPU
-

Key: YARN-810
URL: https://issues.apache.org/jira/browse/YARN-810
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
Attachments: YARN-810.patch, YARN-810.patch

Problem statement:
YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio.
Containers are then allowed to request vcores between the minimum and maximum
defined in the yarn-site.xml.
In the case where a single-threaded container requests 1 vcore, with a
pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of
the core it's using, provided that no other container is also using it. This
happens, even though the only guarantee that YARN/CGroups is making is that
the container will get at least 1/4th of the core.
If a second container then comes along, the second container can take
resources from the first, provided that the first container is still getting
at least its fair share (1/4th).
There are certain cases where this is desirable. There are also certain cases
where it might be desirable to have a hard limit on CPU usage, and not allow
the process to go above the specified resource requirement, even if it's
available.
Here's an RFC that describes the problem in more detail:
http://lwn.net/Articles/336127/
Solution:
As it happens, when CFS is used in combination with CGroups, you can enforce
a ceiling using two files in cgroups:
{noformat}
cpu.cfs_quota_us
cpu.cfs_period_us
{noformat}
The usage of these two files is documented in more detail here:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
Testing:
I have tested YARN CGroups using the 2.0.5-alpha implementation. By default,
it behaves as described above (it is a soft cap, and allows containers to use
more than they asked for). I then tested CFS CPU quotas manually with YARN.
First, you can see that CFS is in use in the CGroup, based on the file names:
{noformat}
[criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
total 0
-r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
-r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
-rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
-rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
[criccomi@eat1-qa464 ~]$ sudo -u app cat
/cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
10
[criccomi@eat1-qa464 ~]$ sudo -u app cat
/cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
-1
{noformat}
Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
We can place processes in hard limits. I have process 4370 running YARN
container container_1371141151815_0003_01_03 on a host. By default, it's
running at ~300% cpu usage.
{noformat}
CPU
4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ...
{noformat}
When I set the CFS quote:
{noformat}
echo 1000
/cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
CPU
4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ...
{noformat}
It drops to 1% usage, and you can see the box has room to spare:
{noformat}
Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si,
0.0%st
{noformat}
Turning the quota back to -1:
{noformat}
echo -1

[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110366#comment-14110366
 ] 

zhihai xu commented on YARN-2452:
-

[~Tsuyoshi OZAWA] thanks for the review. I try to use 
FairSchedulerConfiguration.ASSIGN_MULTIPLE at the beginning. then I get 
compilation error, it is because ASSIGN_MULTIPLE is protected, which can't be 
accessed by the test.
{code}
protected static final String  ASSIGN_MULTIPLE = CONF_PREFIX + assignmultiple;
{code}
Can I  change protected to public at above code?

 TestRMApplicationHistoryWriter is failed for FairScheduler
 --

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2452.000.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.

Xu Yang created YARN-2454:
-

 Summary: The function compareTo of variable UNBOUNDED in 
org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xu Yang






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110378#comment-14110378
 ] 

Tsuyoshi OZAWA commented on YARN-2452:
--

Thanks for your explanation. I don't know why this property is protected.

[~kkambatl], [~sandyr], can we make FairSchedulerConfiguration.ASSIGN_MULTIPLE 
public? Or shouldn't we do that?

 TestRMApplicationHistoryWriter is failed for FairScheduler
 --

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2452.000.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2453:


Attachment: YARN-2453.000.patch

 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
 

 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2453.000.patch


 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
 The following is error message:
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
   Time elapsed: 1.61 sec   FAILURE!
 java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
 check what happened
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)
 This test should only work for capacity scheduler because the following 
 source code in ResourceManager.java prove it will only work for capacity 
 scheduler.
 {code}
 if (scheduler instanceof PreemptableResourceScheduler
conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
   YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
 {code}
 Because CapacityScheduler is instance of PreemptableResourceScheduler and 
 FairScheduler is not  instance of PreemptableResourceScheduler.
 I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA reassigned YARN-2454:


Assignee: Tsuyoshi OZAWA

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xu Yang
Assignee: Tsuyoshi OZAWA





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110388#comment-14110388
 ] 

Beckham007 commented on YARN-2454:
--

The compareTo() of Resource UNBOUNDED is copied from Resource NONE.

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xu Yang
Assignee: Tsuyoshi OZAWA





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2455) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is defined wrong.

Xu Yang created YARN-2455:
-

 Summary: The function compareTo of variable UNBOUNDED in 
org.apache.hadoop.yarn.util.resource.Resources is defined wrong.
 Key: YARN-2455
 URL: https://issues.apache.org/jira/browse/YARN-2455
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xu Yang


The variable UNBOUNDED implement the abstract class Resources, and override the 
function compareTo. But there is something wrong in this function. We should 
not compare resources with zero  as the same as the variable NONE. We should 
change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Beckham007 updated YARN-2454:
-

Labels:   (was: patch)

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
Assignee: Tsuyoshi OZAWA





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Beckham007 updated YARN-2454:
-

Attachment: YARN-2454-patch.diff

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2454-patch.diff






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2454:
-

Assignee: (was: Tsuyoshi OZAWA)

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Yang updated YARN-2454:
--

Description: The variable UNBOUNDED implement the abstract class Resources, 
and override the function compareTo. But there is something wrong in this 
function. We should not compare resources with zero as the same as the variable 
NONE. We should change 0 to Integer.MAX_VALUE.

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110402#comment-14110402
 ] 

Tsuyoshi OZAWA commented on YARN-2454:
--

Thanks for your contribution, [~beckham007]. The fix itself looks good to me. 
How about adding tests to TestResources like this?

{code}
  @Test(timeout=1000)
  public void testCompareToWithUnboundedResource() {
assertTrue(Resources.unbounded().compareTo(
createResource(Integer.MAX_VALUE, Integer.MAX_VALUE)) == 0);
assertTrue(Resources.unbounded().compareTo(
createResource(Integer.MAX_VALUE, 0))  0);
assertTrue(Resources.unbounded().compareTo(
createResource(0, Integer.MAX_VALUE))  0);
  }

  @Test(timeout=1000)
  public void testCompareToWithNoneResource() {
assertTrue(Resources.none().compareTo(createResource(0, 0)) == 0);
assertTrue(Resources.none().compareTo(
createResource(1, 0))  0);
assertTrue(Resources.none().compareTo(
createResource(0, 1))  0);
  }
{code}

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110404#comment-14110404
 ] 

Tsuyoshi OZAWA commented on YARN-2454:
--

sorry, testCompareToWithNoneResource is wrong. A fixed version is as follows:

{code}
  @Test(timeout=1000)
  public void testCompareToWithNoneResource() {
assertTrue(Resources.none().compareTo(createResource(0, 0)) == 0);
assertTrue(Resources.none().compareTo(
createResource(1, 0))  0);
assertTrue(Resources.none().compareTo(
createResource(0, 1))  0);
  }
{code}

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110410#comment-14110410
 ] 

Beckham007 commented on YARN-2454:
--

Hi, [~ozawa]. It could be assertTrue(Resources.unbounded().compareTo(
createResource(Integer.MAX_VALUE, 0)) ** 0) ?

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-2455) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is defined wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Yang resolved YARN-2455.
---

Resolution: Duplicate

Look the same issue YARN-2454.


 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is defined wrong.
 

 Key: YARN-2455
 URL: https://issues.apache.org/jira/browse/YARN-2455
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xu Yang

 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero  as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110413#comment-14110413
 ] 

Tsuyoshi OZAWA commented on YARN-2454:
--

Oops, you're right. Could you update it?

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.

2014-08-26 Thread Wei Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110414#comment-14110414
 ] 

Wei Yan commented on YARN-2454:
---

One more thing, the NONE also need to update to UNBOUNDED.
{code}
@Override
public void setMemory(int memory) {
  throw new RuntimeException(NONE cannot be modified!);
}
@Override
public void setVirtualCores(int cores) {
  throw new RuntimeException(NONE cannot be modified!);
}
{code}

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110423#comment-14110423
 ] 

Beckham007 commented on YARN-2454:
--

I have talked with [~yxls123123], he will update this patch. Thanks~

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2453) TestProportionalCapacityPreemptionPolicy is failed for FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110427#comment-14110427
 ] 

Hadoop QA commented on YARN-2453:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12664328/YARN-2453.000.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4731//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4731//console

This message is automatically generated.

 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler
 

 Key: YARN-2453
 URL: https://issues.apache.org/jira/browse/YARN-2453
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2453.000.patch


 TestProportionalCapacityPreemptionPolicy is failed for FairScheduler.
 The following is error message:
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 Tests run: 18, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.94 sec  
 FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy
 testPolicyInitializeAfterSchedulerInitialized(org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy)
   Time elapsed: 1.61 sec   FAILURE!
 java.lang.AssertionError: Failed to find SchedulingMonitor service, please 
 check what happened
   at org.junit.Assert.fail(Assert.java:88)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy.testPolicyInitializeAfterSchedulerInitialized(TestProportionalCapacityPreemptionPolicy.java:469)
 This test should only work for capacity scheduler because the following 
 source code in ResourceManager.java prove it will only work for capacity 
 scheduler.
 {code}
 if (scheduler instanceof PreemptableResourceScheduler
conf.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
   YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS)) {
 {code}
 Because CapacityScheduler is instance of PreemptableResourceScheduler and 
 FairScheduler is not  instance of PreemptableResourceScheduler.
 I will upload a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110426#comment-14110426
 ] 

Hadoop QA commented on YARN-2454:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12664331/YARN-2454-patch.diff
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4732//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4732//console

This message is automatically generated.

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Yang updated YARN-2454:
--

Attachment: YARN-2454.patch

Thank you for your suggest, Beckham007, Tsuyoshi OZAWA and Wei Yan. I fixed it 
and added two Tests.

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff, YARN-2454.patch


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110472#comment-14110472
 ] 

Hadoop QA commented on YARN-2454:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12664350/YARN-2454.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4733//console

This message is automatically generated.

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff, YARN-2454.patch


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110483#comment-14110483
 ] 

Tsuyoshi OZAWA commented on YARN-2454:
--

[~yxls123123], please generate your patch at root directory of the source code.

Additional minor nits: how about moving the tests to 
org.apache.hadoop.yarn.server.resource.TestResorces instead of adding a new 
file?

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff, YARN-2454.patch


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110525#comment-14110525
 ] 

Xu Yang commented on YARN-2454:
---

[~te...@uproadx.com], Thank you for your suggest. I genetate a new patch at 
root directory.
About moving these tests to 
org.apache.hadoop.yarn.server.resourcemanager.resource.Resources, I think it 
will be strange. Whereas moving the latter to the file that I created is a 
better way.

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454-patch.diff, YARN-2454.patch


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Yang updated YARN-2454:
--

Attachment: YARN-2454 -v2.patch

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454 -v2.patch, YARN-2454-patch.diff, 
 YARN-2454.patch


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


 [ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Yang updated YARN-2454:
--

Attachment: (was: YARN-2454 .patch)

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454 -v2.patch, YARN-2454-patch.diff, 
 YARN-2454.patch


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-08-26 Thread Varun Vasudev (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110555#comment-14110555
]

Varun Vasudev commented on YARN-2440:
-

[~jlowe] the example provided by [~sjlee0] is the one I wanted to address when
I added support for both percentage and absolute cores. Would it make more
sense if I picked the lower value instead of one overriding the other.
Something like -

1. Evaluate cores allocated by yarn.nodemanager.containers-cpu-cores and
yarn.nodemanager.containers-cpu-percentage.
2. Pick the lower of the two values
3. Log a warning/info message that both were specified and that we're picking
the lower value.

{quote}
I'm not thrilled about the name template containers-cpu-* since it could
easily be misinterpreted as a per-container thing as well, but I'm currently at
a loss for a better prefix. Suggestions welcome.
{quote}

How about yarn.nodemanager.all-containers-cpu-cores and
yarn.nodemanager.all-containers-cpu-percentage?

{quote}
Does getOverallLimits need to check for a quotaUS that's too low as well?
{quote}

Thanks for catching this; I'll fix it in the next patch.

{quote}
I think minimally we need to log a warning if we're going to ignore setting up
cgroups to limit CPU usage across all containers if the user specified to do so.
{quote}
I'll add in the logging message.

{quote}
Related to the previous comment, I think it would be nice if we didn't try to
setup any limits if none were specified. That way if there's some issue with
correctly determining the number of cores on a particular system it can still
work in the default, use everything scenario.
{quote}
Will do.

{quote}
NodeManagerHardwareUtils.getContainerCores should be getContainersCores (the
per-container vs. all-containers confusion again)
{quote}
I'll rename the function.

Cgroups should allow YARN containers to be limited to allocated cores
-

Key: YARN-2440
URL: https://issues.apache.org/jira/browse/YARN-2440
Project: Hadoop YARN
Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch,
apache-yarn-2440.2.patch, screenshot-current-implementation.jpg

The current cgroups implementation does not limit YARN containers to the
cores allocated in yarn-site.xml.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-08-26 Thread Varun Vasudev (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110556#comment-14110556
 ] 

Varun Vasudev commented on YARN-2440:
-

[~beckham007] the current implementation of Cgroups uses cpu instead of cpuset, 
probably due to the flexibility offered(sharing the cores is handled by the 
kernel). Is there any particular benefit to cpuset?

 Cgroups should allow YARN containers to be limited to allocated cores
 -

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
 apache-yarn-2440.2.patch, screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores


[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110566#comment-14110566
 ] 

Beckham007 commented on YARN-2440:
--

hi, [~vvasudev] “Cpusets provide a mechanism for assigning a set of CPUs and 
Memory
Nodes to a set of tasks.” 
https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt
For a NM has 24 pcores, we can use cpuset subsystem to make hadoop-yarn use cpu 
core 0-21, and left the others((22,23) for system. And then using cpu.shares to 
share the pcore 0-21. What's more, we can assign a pcore(such as core 21) to 
run a long-running container, and other containers only share  pcore 0-20.

 Cgroups should allow YARN containers to be limited to allocated cores
 -

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
 apache-yarn-2440.2.patch, screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores


[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110567#comment-14110567
 ] 

Beckham007 commented on YARN-2440:
--

In addition, mesos uses cpuset for default. 
https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp

 Cgroups should allow YARN containers to be limited to allocated cores
 -

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
 apache-yarn-2440.2.patch, screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.


[ 
https://issues.apache.org/jira/browse/YARN-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110571#comment-14110571
 ] 

Hadoop QA commented on YARN-2454:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12664364/YARN-2454%20-v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4734//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4734//console

This message is automatically generated.

 The function compareTo of variable UNBOUNDED in 
 org.apache.hadoop.yarn.util.resource.Resources is definited wrong.
 --

 Key: YARN-2454
 URL: https://issues.apache.org/jira/browse/YARN-2454
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0, 2.5.0, 2.4.1
Reporter: Xu Yang
 Attachments: YARN-2454 -v2.patch, YARN-2454-patch.diff, 
 YARN-2454.patch


 The variable UNBOUNDED implement the abstract class Resources, and override 
 the function compareTo. But there is something wrong in this function. We 
 should not compare resources with zero as the same as the variable NONE. We 
 should change 0 to Integer.MAX_VALUE.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-08-26 Thread Eric Payne (JIRA)

[
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110815#comment-14110815
]

Eric Payne commented on YARN-415:
-

bq. -1 release audit. The applied patch generated 3 release audit warnings.
Files triggering audit warnings not part of this patch:
{{EncryptionFaultInjector.java}}, {{EncryptionZoneManager.java
}}, and {{EncryptionZoneWithId.java}}

{quote}
-1 core tests. The patch failed these unit tests
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
{quote}
This test failure is intermittent and does not seem to be caused by this patch.
Please see:
https://builds.apache.org/job/PreCommit-YARN-Build/4711/
https://builds.apache.org/job/PreCommit-YARN-Build/4727/

[~jianhe] and [~kkambatl], I really appreciate all of your help in reviewing
this patch and making it better with your suggestions. How close are we to
getting this patch submitted?

Capture aggregate memory allocation at the app-level for chargeback
---

Key: YARN-415
URL: https://issues.apache.org/jira/browse/YARN-415
Project: Hadoop YARN
Issue Type: New Feature
Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Kendall Thrapp
Assignee: Andrey Klochkov
Attachments: YARN-415--n10.patch, YARN-415--n2.patch,
YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch,
YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch,
YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt,
YARN-415.201406262136.txt, YARN-415.201407042037.txt,
YARN-415.201407071542.txt, YARN-415.201407171553.txt,
YARN-415.201407172144.txt, YARN-415.201407232237.txt,
YARN-415.201407242148.txt, YARN-415.201407281816.txt,
YARN-415.201408062232.txt, YARN-415.201408080204.txt,
YARN-415.201408092006.txt, YARN-415.201408132109.txt,
YARN-415.201408150030.txt, YARN-415.201408181938.txt,
YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.patch

For the purpose of chargeback, I'd like to be able to compute the cost of an
application in terms of cluster resource usage. To start out, I'd like to
get the memory utilization of an application. The unit should be MB-seconds
or something similar and, from a chargeback perspective, the memory amount
should be the memory reserved for the application, as even if the app didn't
use all that memory, no one else was able to use it.
(reserved ram for container 1 * lifetime of container 1) + (reserved ram for
container 2 * lifetime of container 2) + ... + (reserved ram for container n
* lifetime of container n)
It'd be nice to have this at the app level instead of the job level because:
1. We'd still be able to get memory usage for jobs that crashed (and wouldn't
appear on the job history server).
2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
This new metric should be available both through the RM UI and RM Web
Services REST API.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2385) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue

2014-08-26 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110849#comment-14110849
 ] 

Sunil G commented on YARN-2385:
---

Yes [~subru], _moveAllApps_ is also using this api. 

bq. If not maybe we should defer the splitting till we have a concrete use case?

Now the behavior of *getAppsInQueue* in _killAllAppsInQueue_, _getApplications_ 
and _moveAllApps_ is different with Capacity Scheduler and Fair Scheduler. 
Hence I feel that we can split up and make a uniform behavior in all these 
caller sides. How do you feel?

 Consider splitting getAppsinQueue to getRunningAppsInQueue + 
 getPendingAppsInQueue
 --

 Key: YARN-2385
 URL: https://issues.apache.org/jira/browse/YARN-2385
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, fairscheduler
Reporter: Subramaniam Krishnan
  Labels: abstractyarnscheduler

 Currently getAppsinQueue returns both pending  running apps. The purpose of 
 the JIRA is to explore splitting it to getRunningAppsInQueue + 
 getPendingAppsInQueue that will provide more flexibility to callers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2102) More generalized timeline ACLs

2014-08-26 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110899#comment-14110899
 ] 

Zhijie Shen commented on YARN-2102:
---

Remove the last patch, as the method doesn't work given the group in the 
reader/writer list.

 More generalized timeline ACLs
 --

 Key: YARN-2102
 URL: https://issues.apache.org/jira/browse/YARN-2102
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: GeneralizedTimelineACLs.pdf, YARN-2102.1.patch, 
 YARN-2102.2.patch, YARN-2102.3.patch


 We need to differentiate the access controls of reading and writing 
 operations, and we need to think about cross-entity access control. For 
 example, if we are executing a workflow of MR jobs, which writing the 
 timeline data of this workflow, we don't want other user to pollute the 
 timeline data of the workflow by putting something under it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2102) More generalized timeline ACLs

2014-08-26 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2102:
--

Attachment: (was: YARN-2102.4.patch)

 More generalized timeline ACLs
 --

 Key: YARN-2102
 URL: https://issues.apache.org/jira/browse/YARN-2102
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: GeneralizedTimelineACLs.pdf, YARN-2102.1.patch, 
 YARN-2102.2.patch, YARN-2102.3.patch


 We need to differentiate the access controls of reading and writing 
 operations, and we need to think about cross-entity access control. For 
 example, if we are executing a workflow of MR jobs, which writing the 
 timeline data of this workflow, we don't want other user to pollute the 
 timeline data of the workflow by putting something under it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler


 [ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2452:


Attachment: YARN-2452.001.patch

 TestRMApplicationHistoryWriter is failed for FairScheduler
 --

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2452.000.patch, YARN-2452.001.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110968#comment-14110968
 ] 

zhihai xu commented on YARN-2452:
-

I uploaded a new patch YARN-2452.001.patch. It splits 
testRMWritingMassiveHistory into two tests: 
testRMWritingMassiveHistoryForFairSche and 
testRMWritingMassiveHistoryForCapacitySche.One for fair scheduler and one for 
Capacity scheduler. So we can test both schedulers.

 TestRMApplicationHistoryWriter is failed for FairScheduler
 --

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2452.000.patch, YARN-2452.001.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-08-26 Thread Junping Du (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111015#comment-14111015
]

Junping Du commented on YARN-2033:
--

[~zjshen], the patch looks good to me in over all. I don't have much more
comments, except below. Would you fix it?
bq. First of all, two queries are not duplicate: one to read the application
entity, and the other to read the app attempt entity, and we previously
distinguish ApplicationNotFoundException and
ApplicationAttemptNotFoundException. It is always possible that App1 exists in
the store with the only attempt AppAttempt1 while the user looks up for
AppAttempt2. In this case, we know App1 is there, but AppAttempt2 isn't, so we
will throw ApplicationAttemptNotFoundException.
If we really want to differ the two exceptions, we still can check
ApplicationAttempt first and check Application later (to see if throw
ApplicationNotFoundException instead) if ApplicationAttemptNotFoundException
get thrown there. This is more efficient as we only need to visit DB one time
in most cases. Isn't it?

Investigate merging generic-history into the Timeline Store
---

Key: YARN-2033
URL: https://issues.apache.org/jira/browse/YARN-2033
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Zhijie Shen
Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf,
YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch,
YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch,
YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch,
YARN-2033_ALL.4.patch

Having two different stores isn't amicable to generic insights on what's
happening with applications. This is to investigate porting generic-history
into the Timeline Store.
One goal is to try and retain most of the client side interfaces as close to
what we have today.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2452) TestRMApplicationHistoryWriter is failed for FairScheduler


[ 
https://issues.apache.org/jira/browse/YARN-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111073#comment-14111073
 ] 

Hadoop QA commented on YARN-2452:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12664426/YARN-2452.001.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4735//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4735//console

This message is automatically generated.

 TestRMApplicationHistoryWriter is failed for FairScheduler
 --

 Key: YARN-2452
 URL: https://issues.apache.org/jira/browse/YARN-2452
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-2452.000.patch, YARN-2452.001.patch


 TestRMApplicationHistoryWriter is failed for FairScheduler. The failure is 
 the following:
 T E S T S
 ---
 Running 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 69.311 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
 testRMWritingMassiveHistory(org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter)
   Time elapsed: 66.261 sec   FAILURE!
 java.lang.AssertionError: expected:1 but was:200
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:430)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter.testRMWritingMassiveHistory(TestRMApplicationHistoryWriter.java:391)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue

2014-08-26 Thread Wei Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2395:
--

Attachment: YARN-2395-3.patch

 FairScheduler: Preemption timeout should be configurable per queue
 --

 Key: YARN-2395
 URL: https://issues.apache.org/jira/browse/YARN-2395
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, 
 YARN-2395-3.patch


 Currently in fair scheduler, the preemption logic considers fair share 
 starvation only at leaf queue level. This jira is created to implement it at 
 the parent queue as well.
 It involves :
 1. Making check for fair share starvation and amount of resource to 
 preempt  recursive such that they traverse the queue hierarchy from root to 
 leaf.
 2. Currently fairSharePreemptionTimeout is a global config. We could make it 
 configurable on a per queue basis,so that we can specify different timeouts 
 for parent queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue


[ 
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111090#comment-14111090
 ] 

Karthik Kambatla commented on YARN-2395:


bq. Which means starvation at parent queues would not be detected and 
preemption at parent will not happen. Am I missing something ?
If a parent queue is starved, wouldn't at least one of the child queues starve? 
Is there a counter example? 

 FairScheduler: Preemption timeout should be configurable per queue
 --

 Key: YARN-2395
 URL: https://issues.apache.org/jira/browse/YARN-2395
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, 
 YARN-2395-3.patch


 Currently in fair scheduler, the preemption logic considers fair share 
 starvation only at leaf queue level. This jira is created to implement it at 
 the parent queue as well.
 It involves :
 1. Making check for fair share starvation and amount of resource to 
 preempt  recursive such that they traverse the queue hierarchy from root to 
 leaf.
 2. Currently fairSharePreemptionTimeout is a global config. We could make it 
 configurable on a per queue basis,so that we can specify different timeouts 
 for parent queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue

[
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1431#comment-1431
]

Karthik Kambatla commented on YARN-2395:

I think I now get Ashwin's point. Ashwin - please correct me if I am wrong.

If the parent queue has a timeout of 5 seconds and all the child queues have a
timeout of 30 seconds, any preemption under the parent queue kicks in only
after 30 seconds and not 5 seconds.

I am not sure we can really do much in this case. It can be a case of
misconfiguration that we might want to warn. But again, if a leaf queue were to
be created under, that would inherit these timeouts.

FairScheduler: Preemption timeout should be configurable per queue
--

Key: YARN-2395
URL: https://issues.apache.org/jira/browse/YARN-2395
Project: Hadoop YARN
Issue Type: New Feature
Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch,
YARN-2395-3.patch

Currently in fair scheduler, the preemption logic considers fair share
starvation only at leaf queue level. This jira is created to implement it at
the parent queue as well.
It involves :
1. Making check for fair share starvation and amount of resource to
preempt recursive such that they traverse the queue hierarchy from root to
leaf.
2. Currently fairSharePreemptionTimeout is a global config. We could make it
configurable on a per queue basis,so that we can specify different timeouts
for parent queues.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue

2014-08-26 Thread Ashwin Shankar (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1448#comment-1448
]

Ashwin Shankar commented on YARN-2395:
--

[~kasha],
bq. If a parent queue is starved, wouldn't at least one of the child queues
starve?
Not always. Here is an example :

Queue hierarchy :
root.lowPriorityLeaf - fair share = 10%
root.HighPriorityParent - fair share = 90% fairSharePreemptionThreshold=1
root.HighPriorityParent.child(1-10)
Scenario :
Apps running in root.lowPriorityLeaf, root.HighPriorityParent.child1,
root.HighPriorityParent.child2(Remember we now have fair share for active
queues)
Following situation is possible :
root.lowPriorityLeaf : *usage = 55% demand = 55% fair share = 10%*
root.HighPriorityParent.child1: *usage = 45% demand = 85% fair share = 45%*
root.HighPriorityParent.child2 : usage = 5% demand = 5% fair share = 45%

In above example, low priority queue with fair share 10% is taking up 55% of
the cluster, while HighPriorityParent.child1 needs 85%, but can get only 45%
through preemption since thats its fair share. Another point is
HighPriorityParent.child2 has a fair share of 45%, but needs only 5%.
*Note that both child1,child2 are NOT starved, but HighPriorityParent is
starved.*

Use case is basically this : We want ALL 90% of the cluster resources to go to
HighPriorityParent whenever its needed by ANY of its children. We can do that
by detecting starvation at parent HighPriorityParent and preempt from
lowPriorityLeaf.

FairScheduler: Preemption timeout should be configurable per queue
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111235#comment-14111235
]

Jian He commented on YARN-1372:
---

[~adhoot], Any thoughts on my last comments ?
bq. the same justFinishedContainers set can be used to return to AM and ack NMs?
bq. I meant can we remove all the containers in NMContext for the application
once we received the NodeHeartbeatResponse#getApplicationsToCleanup
notification, instead of depending on expiration.
bq. I meant is it possible for NM at DECOMMISSIONED/LOST state to receive the
newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we need to handle them
too.

Ensure all completed containers are reported to the AMs across RM restart
-

Key: YARN-1372
URL: https://issues.apache.org/jira/browse/YARN-1372
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
Attachments: YARN-1372.001.patch, YARN-1372.001.patch,
YARN-1372.prelim.patch, YARN-1372.prelim2.patch

Currently the NM informs the RM about completed containers and then removes
those containers from the RM notification list. The RM passes on that
completed container information to the AM and the AM pulls this data. If the
RM dies before the AM pulls this data then the AM may not be able to get this
information again. To fix this, NM should maintain a separate list of such
completed container notifications sent to the RM. After the AM has pulled the
containers from the RM then the RM will inform the NM about it and the NM can
remove the completed container from the new list. Upon re-register with the
RM (after RM restart) the NM should send the entire list of completed
containers to the RM along with any other containers that completed while the
RM was dead. This ensures that the RM can inform the AM's about all completed
containers. Some container completions may be reported more than once since
the AM may have pulled the container but the RM may die before notifying the
NM about the pull.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps

Jian He created YARN-2456:
-

 Summary: Possible deadlock in CapacityScheduler when RM is 
recovering apps
 Key: YARN-2456
 URL: https://issues.apache.org/jira/browse/YARN-2456
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He



Consider this scenario:
1. RM is configured with a single queue and only one application can be active 
at a time.
2. Submit App1 which uses up the queue's whole capacity
3. Submit App2 which remains pending.
4. Restart RM.
5. App2 is recovered before App1, so App2 is added to the activeApplications 
list. Now App1 remains pending (because of max-active-app limit)
6. All containers of App1 are now recovered when NM registers, and use up the 
whole queue capacity again.
7. Since the queue is full, App2 cannot proceed to allocate AM container.
8. In the meanwhile, App1 cannot proceed to become active because of the 
max-active-app limit 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-08-26 Thread Jason Lowe (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111471#comment-14111471
]

Jason Lowe commented on YARN-2440:
--

For the case presented by [~sjlee0] the user has an 8 core system and wants to
use at most 6 cores for YARN containers. That can be done by simply setting
containers-cpu-percentage to 75. I don't see why we need a separate
containers-cpu-cores parameter here, and I think it causes more problems than
it solves per my previous comment. If we only want to support whole-core
granularity then I can see containers-cpu-cores as a better choice, but
otherwise containers-cpu-percentage is more flexible.

Also I don't see vcores being relevant for this JIRA. The way vcores map to
physical cores is node-dependent, but apps ask for vcores in a node-independent
fashion. IIUC this JIRA is focused on simply limiting the amount of CPU all
YARN containers on the node can possibly use in aggregate. Changing the
vcore-to-core ratio on the node will change how many containers the node might
run simultaneously, but it shouldn't impact how much of the physical CPU the
user wants reserved for non-container processes.

On a related note, it's interesting to step back and see if this is really what
most users will want in practice. If the intent is to ensure the NM, DN, and
other system processes get enough CPU time then I think a better approach is to
put those system processes in a peer cgroup to the YARN containers cgroup and
set their relative CPU shares accordingly. Then YARN containers can continue
to use any spare CPU if desired (i.e.: no CPU fragmentation) but the system
processes are guaranteed not to be starved out by the YARN containers. Some
users may want a hard limit and hence why this feature would be useful for
them, but I suspect most users will not want to leave spare CPU lying around
when containers need it.

bq. How about yarn.nodemanager.all-containers-cpu-cores and
yarn.nodemanager.all-containers-cpu-percentage?

I'm indifferent on adding all as a prefix. Something like
yarn.nodemanager.containers-limit-cpu-percentage might be more clear that this
is a hard limit and CPUs can go idle even if containers are demanding more from
the machine than this limit.

Cgroups should allow YARN containers to be limited to allocated cores
-

The current cgroups implementation does not limit YARN containers to the
cores allocated in yarn-site.xml.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic

2014-08-26 Thread Carlo Curino (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111520#comment-14111520
 ] 

Carlo Curino commented on YARN-1707:


Wangda,

Thanks for the great feedback. You spot a bunch of oddities that were there due 
to previous versions of the reservation system, 
but not needed anymore, I think the updated version is definitely cleaner. 

We address 1,2 by:
 * moving addQueue and removeQueue to PlanQueue (as they were only invoked on 
instances of the subclass).
 * making uniform checks from within the PlanQueue for capacity  0, and throw 
uniform SchedulerConfigEditException
 * fixing the log, and making the logs more uniform

We address 3,6 by:
 * merge addCapacity and subtractCapacity into a single changeCapacity
 * make checks of range limits 0,1 
(this reduced code both in CS and ReservationQueue... good call!)

We address 4 by:
 * getReservableQueues() has been renamed to getPlanQueues()

Regarding 5: ReservationQueue#getQueueName 
 * This is the result of our previous conversations with Vinod, Bikas, and 
Arun. The idea is that the user should not be 
aware of the fact that we use queues to implement reservations, and thus it 
shouldn't see the name of the reservation 
queue to be listed in the UI, but rather the name of the parent PlanQueue. More 
precisely, we have options for the UI 
to show or not the subqueues, but this differentiation is needed here to allow 
that: getQueueName for a ReservationQueue
return the parent, while getReservationQueueName() returns the actual local 
name.

Regarding 7: DynamicQueueConf
* We currently are only dynamically assigning capacity, but you can imagine in 
the future that this is extended to set many
more parameters for a queue (user-limit factors, max applications, etc..). The 
conf-based mechanism is future-proofing against this.

Regarding 8:  ParentQueue#setChildQueues
* I don't understand the comment. This check is automatically bypassed for 
PlanQueue (that by design have no children see 
CapacityScheduler near line 562).

We are testing the new version of the patch now, and will post patch soon.


 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1990) Track time-to-allocation for different size containers

2014-08-26 Thread Carlo Curino (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111535#comment-14111535
 ] 

Carlo Curino commented on YARN-1990:


We had a simple implementation for this, using QueueMetrics and maintaining a 
map of delays (using SampleQuantile), tracking start-end 
of the wait-time by catching reserve() and unreserve() calls.  In our test 
environment, it didn't seem to matter much. 

We might further investigate, and post patches after YARN-1051 is committed.

 Track time-to-allocation for different size containers 
 ---

 Key: YARN-1990
 URL: https://issues.apache.org/jira/browse/YARN-1990
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Carlo Curino
Assignee: Carlo Curino

 Allocation of Large Containers are notoriously problematic, as smaller 
 containers can more easily grab resources. 
 The proposal for this JIRA is to maintain a map of container sizes, and 
 time-to-allocation, that can be used as:
 * general insight on cluster behavior, 
 * to inform the reservation-system, and allows us to account for delays in 
 allocation, so that the user reservation is respected regardless the size of 
 containers requested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic

2014-08-26 Thread Carlo Curino (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-1707:
---

Attachment: YARN-1707.4.patch

 Making the CapacityScheduler more dynamic
 -

 Key: YARN-1707
 URL: https://issues.apache.org/jira/browse/YARN-1707
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
  Labels: capacity-scheduler
 Attachments: YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, 
 YARN-1707.patch


 The CapacityScheduler is a rather static at the moment, and refreshqueue 
 provides a rather heavy-handed way to reconfigure it. Moving towards 
 long-running services (tracked in YARN-896) and to enable more advanced 
 admission control and resource parcelling we need to make the 
 CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
 YARN-1051.
 Concretely this require the following changes:
 * create queues dynamically
 * destroy queues dynamically
 * dynamically change queue parameters (e.g., capacity) 
 * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% 
 instead of ==100%
 We limit this to LeafQueues. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)


[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111597#comment-14111597
 ] 

Maysam Yabandeh commented on YARN-2405:
---

I am thinking of a simple patch that catches the NPE at skips adding the record 
to appsTableData. 

Comments are highly appreciated.

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()


 [ 
https://issues.apache.org/jira/browse/YARN-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated YARN-2430:
--

Assignee: Sandy Ryza  (was: Maysam Yabandeh)

 FairShareComparator: cache the results of getResourceUsage()
 

 Key: YARN-2430
 URL: https://issues.apache.org/jira/browse/YARN-2430
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Maysam Yabandeh
Assignee: Sandy Ryza

 The compare of FairShareComparator has 3 invocation of  getResourceUsage per 
 comparable object. In the case of queues, the implementation of 
 getResourceUsage requires iterating over the apps and adding up their current 
 usage. The compare method can reuse the result of getResourceUsage to reduce 
 the load by third. However, to further reduce the load the result of 
 getResourceUsage can be cached in FSLeafQueue. This would be more efficient 
 since the invocation of compare method on each Comparable object is = 1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1506) Replace set resource change on RMNode/SchedulerNode directly with event notification.


[ 
https://issues.apache.org/jira/browse/YARN-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111610#comment-14111610
 ] 

Jian He commented on YARN-1506:
---

Junping , thanks for the update.
- Move the following to the previous else condition where update is performed 
successfully
{code}
RMAuditLogger.logSuccess(user.getShortUserName(), argName,
AdminService);
{code}
- Update node’s capacity only if capacity changes ? and we may directly sends 
NodeResourceUpdateSchedulerEvent here, instead of making node send event to 
itself
{code}
  // Update node's capacity for reconnect node.
  rmNode.context.getDispatcher().getEventHandler().handle(
  new RMNodeResourceUpdateEvent(rmNode.nodeId, 
  ResourceOption.newInstance(rmNode.totalCapability, -1)));
{code}
- maybe nodeAndQueueResourceUpdate - updateNodeAndQueueResource , and 
similarly nodeResourceUpdate - updateNodeResource
- given that we put the common method in AbstractYarnScheduler already. we can 
move SchedulerUtils.updateResourceOnSchedulerNode to 
AbstractYarnScheduler#nodeResourceUpdate also.
- FiCaSchedulerNode: import only changes, we can revert. 
- testResourceOverCommit in CapacityScheduler and FifoScheduler are almost the 
same. I think we can create a new test file and use parameterize for all 
scheduler types. 
- In AdminService: we may updateNodeResource only if node resource changes? 
- In FairScheduler, I think we should do following too when updating resource, 
as in FairScheduler#addNode():
{code}
updateRootQueueMetrics();
queueMgr.getRootQueue().setSteadyFairShare(clusterResource);
queueMgr.getRootQueue().recomputeSteadyShares();
{code}

 Replace set resource change on RMNode/SchedulerNode directly with event 
 notification.
 -

 Key: YARN-1506
 URL: https://issues.apache.org/jira/browse/YARN-1506
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, scheduler
Reporter: Junping Du
Assignee: Junping Du
 Attachments: YARN-1506-v1.patch, YARN-1506-v10.patch, 
 YARN-1506-v11.patch, YARN-1506-v12.patch, YARN-1506-v13.patch, 
 YARN-1506-v2.patch, YARN-1506-v3.patch, YARN-1506-v4.patch, 
 YARN-1506-v5.patch, YARN-1506-v6.patch, YARN-1506-v7.patch, 
 YARN-1506-v8.patch, YARN-1506-v9.patch


 According to Vinod's comments on YARN-312 
 (https://issues.apache.org/jira/browse/YARN-312?focusedCommentId=13846087page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13846087),
  we should replace RMNode.setResourceOption() with some resource change event.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps


[ 
https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111624#comment-14111624
 ] 

Jian He commented on YARN-2456:
---

One thing we can do is to add the application to scheduler based on the 
application submission order.  i.e. sort the apps first based on applicationId 
before recovering the apps

 Possible deadlock in CapacityScheduler when RM is recovering apps
 -

 Key: YARN-2456
 URL: https://issues.apache.org/jira/browse/YARN-2456
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 Consider this scenario:
 1. RM is configured with a single queue and only one application can be 
 active at a time.
 2. Submit App1 which uses up the queue's whole capacity
 3. Submit App2 which remains pending.
 4. Restart RM.
 5. App2 is recovered before App1, so App2 is added to the activeApplications 
 list. Now App1 remains pending (because of max-active-app limit)
 6. All containers of App1 are now recovered when NM registers, and use up the 
 whole queue capacity again.
 7. Since the queue is full, App2 cannot proceed to allocate AM container.
 8. In the meanwhile, App1 cannot proceed to become active because of the 
 max-active-app limit 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)


[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111710#comment-14111710
 ] 

Tsuyoshi OZAWA commented on YARN-2405:
--

Hi [~maysamyabandeh], could you tell us the version you faced this problem?

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-08-26 Thread Anubhav Dhoot (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111713#comment-14111713
]

Anubhav Dhoot commented on YARN-1372:
-

bq. I meant is it possible for NM at DECOMMISSIONED/LOST state to receive the
newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we need to handle them
too.
Fixed that.

bq. the same justFinishedContainers set can be used to return to AM and ack NMs?
There are 3 states to completed containers in this set.
a) Container added to justFinishedContainer but not yet sent to AM.
b) Container sent to AM in a previous allocateResponse but is not yet acked
c) Next allocate call from AM has happened after the container was sent. This
implicitly acks from AM point of view and now can be sent to NM.
Instead of having some additional state to track a) and b), I used 2
collections justFinishedContainers and previousJustFinishedContainers
respectively. Have added tests to show that.

bq. I meant can we remove all the containers in NMContext for the application
once we received the NodeHeartbeatResponse#getApplicationsToCleanup
notification, instead of depending on expiration.

I tried doing that but had one issue. ApplicationImpl which has the mapping of
application to containers, cannot access the event dispatcher for
ContainerManagerImpl (which is the one removing the containers from context). I
am going to upload a patch that removes the dispatcher local to
ContainerManagerImpl (~/patches/YARN-1372.002_NMHandlesCompletedApp.patch).

I looked into an alternate approach where the RM acks the completed containers
that belong to an App thats completed. I am uploading that patch as well
(~/patches/YARN-1372.002_RMHandlesCompletedApp.patch)

Ensure all completed containers are reported to the AMs across RM restart
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-08-26 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1372:


Attachment: YARN-1372.002_RMHandlesCompletedApp.patch

Addresses feedback by having RM ack completed containers for a completed app.

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
 YARN-1372.002_NMHandlesCompletedApp.patch, 
 YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.prelim.patch, 
 YARN-1372.prelim2.patch


 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-08-26 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1372:


Attachment: YARN-1372.002_NMHandlesCompletedApp.patch

Addresses feedback by having NM remove containers for a completed app from 
context.

 Ensure all completed containers are reported to the AMs across RM restart
 -

 Key: YARN-1372
 URL: https://issues.apache.org/jira/browse/YARN-1372
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Anubhav Dhoot
 Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
 YARN-1372.002_NMHandlesCompletedApp.patch, 
 YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.prelim.patch, 
 YARN-1372.prelim2.patch


 Currently the NM informs the RM about completed containers and then removes 
 those containers from the RM notification list. The RM passes on that 
 completed container information to the AM and the AM pulls this data. If the 
 RM dies before the AM pulls this data then the AM may not be able to get this 
 information again. To fix this, NM should maintain a separate list of such 
 completed container notifications sent to the RM. After the AM has pulled the 
 containers from the RM then the RM will inform the NM about it and the NM can 
 remove the completed container from the new list. Upon re-register with the 
 RM (after RM restart) the NM should send the entire list of completed 
 containers to the RM along with any other containers that completed while the 
 RM was dead. This ensures that the RM can inform the AM's about all completed 
 containers. Some container completions may be reported more than once since 
 the AM may have pulled the container but the RM may die before notifying the 
 NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

[
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111718#comment-14111718
]

Hadoop QA commented on YARN-1372:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12664541/YARN-1372.002_RMHandlesCompletedApp.patch
against trunk revision .

{color:red}-1 patch{color}. Trunk compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4738//console

This message is automatically generated.

Ensure all completed containers are reported to the AMs across RM restart
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2457) FairScheduler: Handle preemption to help starved parent queues

Karthik Kambatla created YARN-2457:
--

 Summary: FairScheduler: Handle preemption to help starved parent 
queues
 Key: YARN-2457
 URL: https://issues.apache.org/jira/browse/YARN-2457
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.5.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


YARN-2395/YARN-2394 add preemption timeout and threshold per queue, but don't 
check for parent queue starvation. 

We need to check that. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue

[
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111722#comment-14111722
]

Karthik Kambatla commented on YARN-2395:

I see the issue now. Thanks for catching it, Ashwin.

Wei and I discussed this offline to see what might be the best way to handle
this, and here is what I think might work:
# Starved leaf queues will continue to be handled the way they are in the
latest patch.
# YARN-2154 changes the behavior for leaf queues to look at actual
ResourceRequests and preempt only matching containers.
# YARN-2457: For each starved parent queue, pick an application with positive
demand and the least disadvantaged in terms of allocation and preempt
containers that match that.

I propose we get this in and follow up on other items in respective JIRAs.

FairScheduler: Preemption timeout should be configurable per queue
--

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2154) FairScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request


[ 
https://issues.apache.org/jira/browse/YARN-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111724#comment-14111724
 ] 

Karthik Kambatla commented on YARN-2154:


Started looking into this. Today, we just look at the amount of resources to be 
preempted. Instead, we should collect a list of applications for which we are 
preempting containers. Iterate through these applications and their 
ResourceRequests to find potential matches in free resources and subsequently 
in resources assigned to another application that is over its fairshare. 

Will post a patch for this once YARN-2395 and YARN-2394 get committed. 

 FairScheduler: Improve preemption to preempt only those containers that would 
 satisfy the incoming request
 --

 Key: YARN-2154
 URL: https://issues.apache.org/jira/browse/YARN-2154
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Today, FairScheduler uses a spray-gun approach to preemption. Instead, it 
 should only preempt resources that would satisfy the incoming request. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)


[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111725#comment-14111725
 ] 

Maysam Yabandeh commented on YARN-2405:
---

[~ozawa], we got the error in a fork of 2.0.5 but further code inspection 
showed that the problem also exist in 2.5. 

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2385) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue

2014-08-26 Thread Subramaniam Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111737#comment-14111737
 ] 

Subramaniam Krishnan commented on YARN-2385:


[~sunilg], the behavior of *getAppsInQueue* should be same for both CS  FS 
unless I am missing something. As part of YARN-2378, I added pending apps also 
to CS#getAppsInQueue.

 Consider splitting getAppsinQueue to getRunningAppsInQueue + 
 getPendingAppsInQueue
 --

 Key: YARN-2385
 URL: https://issues.apache.org/jira/browse/YARN-2385
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, fairscheduler
Reporter: Subramaniam Krishnan
  Labels: abstractyarnscheduler

 Currently getAppsinQueue returns both pending  running apps. The purpose of 
 the JIRA is to explore splitting it to getRunningAppsInQueue + 
 getPendingAppsInQueue that will provide more flexibility to callers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)

2014-08-26 Thread hex108 (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111740#comment-14111740
 ] 

hex108 commented on YARN-2405:
--

If RMApp has not been accepted by scheduler, it will only be recorded in 
`MapApplicationId, RMApp rmContext.getRMApps()`. So I think we could first 
test whether it is in `MapApplicationId, SchedulerApplication applications`, 
then we decide whether to get its fair. Is it OK?

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue


[ 
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111747#comment-14111747
 ] 

Karthik Kambatla commented on YARN-2395:


Latest patch looks good to me. +1

[~ashwinshankar77] - does the plan and the current patch look alright to you? 

 FairScheduler: Preemption timeout should be configurable per queue
 --

 Key: YARN-2395
 URL: https://issues.apache.org/jira/browse/YARN-2395
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2395-1.patch, YARN-2395-2.patch, YARN-2395-3.patch, 
 YARN-2395-3.patch


 Currently in fair scheduler, the preemption logic considers fair share 
 starvation only at leaf queue level. This jira is created to implement it at 
 the parent queue as well.
 It involves :
 1. Making check for fair share starvation and amount of resource to 
 preempt  recursive such that they traverse the queue hierarchy from root to 
 leaf.
 2. Currently fairSharePreemptionTimeout is a global config. We could make it 
 configurable on a per queue basis,so that we can specify different timeouts 
 for parent queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)


[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111757#comment-14111757
 ] 

Tsuyoshi OZAWA commented on YARN-2405:
--

[~maysamyabandeh], [~hex108], I got the problem. I think trunk code still has 
the issue. Let me tackle this.

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)


[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111758#comment-14111758
 ] 

Maysam Yabandeh commented on YARN-2405:
---

Sounds good to me. We also need to decide of how to react to nonexistent app: 
return fair share of 0, -1, or skip the whole record from appsTableData? If the 
the problematic record is going to be skipped, instead of putting checks inside 
fair share computation, we can alternatively catch the NPE at 
FairSchedulerAppsBlock.

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)


[ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111770#comment-14111770
 ] 

Maysam Yabandeh commented on YARN-2405:
---

[~ozawa], all right then. Looking forward for your patch...

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2405) NPE in FairSchedulerAppsBlock (scheduler page)


 [ 
https://issues.apache.org/jira/browse/YARN-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA reassigned YARN-2405:


Assignee: Tsuyoshi OZAWA

 NPE in FairSchedulerAppsBlock (scheduler page)
 --

 Key: YARN-2405
 URL: https://issues.apache.org/jira/browse/YARN-2405
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Maysam Yabandeh
Assignee: Tsuyoshi OZAWA

 FairSchedulerAppsBlock#render throws NPE at this line
 {code}
   int fairShare = fsinfo.getAppFairShare(attemptId);
 {code}
 This causes the scheduler page now showing the app since it lack the 
 definition of appsTableData
 {code}
  Uncaught ReferenceError: appsTableData is not defined 
 {code}
 The problem is temporary meaning that it is usually resolved by itself either 
 after a retry or after a few hours.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2454) The function compareTo of variable UNBOUNDED in org.apache.hadoop.yarn.util.resource.Resources is definited wrong.