[jira] [Updated] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2054:
---

Attachment: yarn-2054-3.patch

Patch with unit test.

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch


 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010854#comment-14010854
 ] 

Karthik Kambatla commented on YARN-2010:


bq. In non-workpreserving restart, since the old attempt will be essentially 
killed on RM restart, new attempt will be automatically started and it will 
have the new clientTokenMaterKey key generated. So we may not need to fail this 
app.
The stack trace corresponds to non-work-preserving restart. I am not sure I 
understand the concern.

This JIRA addresses all cases where an app recovery fails. Examples include 
token issues, queue ACL changes that disallow the user from submitting to the 
queue etc. In any of these cases, users should have the option of continuing 
with starting the RM.

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 

[jira] [Commented] (YARN-2092) Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 2.5.0-SNAPSHOT

2014-05-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010906#comment-14010906
 ] 

Steve Loughran commented on YARN-2092:
--

# code using Apache Curator was one -in YARN and on the client if you picked up 
the locally installed hadoop CP. And we couldn't upload a newer version of 
Jackson, because again, what's on the CP is what you get. 
# If, client-side you abandon that CP and use/redist your entire hadoop binary 
set then you can reduce the risk here, at the expense of taking away from ops 
any control of the versions of things you run on the cluster, but then you now 
have to deal with older files in the cluster.
# or you ignore yarn.lib.classpath entirely, *somehow* work out the values of 
yarn-site.xml c, and re-upload every single hadoop-*.jar and its chosen 
binaries into every single container. having a YARN artifact repo will reduce 
the cost of that, but add a new one: bug fixes in hadoop will only propagate 
when the apps are rebuilt.
# ..if you look at the HADOOP-9991 issue you can see links to some places where 
the outdated JARs in Hadoop cause problems for other ASF projects.
# Tez appears to have broken because it was explicity putting the 1.8.x JARs on 
its list of binaries to upload. It only worked because it was using exactly the 
same version. 
# if you adopt a policy of change no dependencies that break apps that upload 
duplicate JARs to the CP -then this goes beyond Jackson, it says hadoop cannot 
update any of its dependencies. That would go for 2.x and no doubt even if we 
did update things for 3.x, then we'll still get you broke my code that 
uploaded jackson 
1.8 issues.
# ...and we haven't gone near Guava yet, which is frozen because it really is 
so brittle, but that means we can't pick up the guava 16.x-only fixes needed to 
work with the latest JVMs. 

If you do want to revoke jackson, I'm not going to veto it -but it goes beyond 
YARN, and we may as well revert every single HADOOP-9991-related upgrade. 

 Incompatible org.codehaus.jackson* dependencies when moving from 2.4.0 to 
 2.5.0-SNAPSHOT
 

 Key: YARN-2092
 URL: https://issues.apache.org/jira/browse/YARN-2092
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah

 Came across this when trying to integrate with the timeline server. Using a 
 1.8.8 dependency of jackson works fine against 2.4.0 but fails against 
 2.5.0-SNAPSHOT which needs 1.9.13. This is in the scenario where the user 
 jars are first in the classpath.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010998#comment-14010998
 ] 

Hadoop QA commented on YARN-2054:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647073/yarn-2054-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 27 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-common-project/hadoop-nfs 
hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-nfs 
hadoop-tools/hadoop-distcp hadoop-tools/hadoop-sls 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  
org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell
  org.apache.hadoop.yarn.client.TestRMAdminCLI
  
org.apache.hadoop.hdfs.server.namenode.TestSecondaryNameNodeUpgrade

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3843//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3843//console

This message is automatically generated.

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch


 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012

2014-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011021#comment-14011021
 ] 

Hudson commented on YARN-2105:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #566 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/566/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 Fix TestFairScheduler after YARN-2012
 -

 Key: YARN-2105
 URL: https://issues.apache.org/jira/browse/YARN-2105
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Ashwin Shankar
 Fix For: 2.5.0

 Attachments: YARN-2105-v1.txt


 The following tests fail in trunk:
 {code}
 Failed tests:
   TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:1 but was:0
 Tests in error:
   TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer
   TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue

2014-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011022#comment-14011022
 ] 

Hudson commented on YARN-2012:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #566 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/566/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 Fair Scheduler: allow default queue placement rule to take an arbitrary queue
 -

 Key: YARN-2012
 URL: https://issues.apache.org/jira/browse/YARN-2012
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Fix For: 2.5.0

 Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt


 Currently 'default' rule in queue placement policy,if applied,puts the app in 
 root.default queue. It would be great if we can make 'default' rule 
 optionally point to a different queue as default queue .
 This default queue can be a leaf queue or it can also be an parent queue if 
 the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue

2014-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011103#comment-14011103
 ] 

Hudson commented on YARN-2012:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1757 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1757/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 Fair Scheduler: allow default queue placement rule to take an arbitrary queue
 -

 Key: YARN-2012
 URL: https://issues.apache.org/jira/browse/YARN-2012
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Fix For: 2.5.0

 Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt


 Currently 'default' rule in queue placement policy,if applied,puts the app in 
 root.default queue. It would be great if we can make 'default' rule 
 optionally point to a different queue as default queue .
 This default queue can be a leaf queue or it can also be an parent queue if 
 the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2105) Fix TestFairScheduler after YARN-2012

2014-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011182#comment-14011182
 ] 

Hudson commented on YARN-2105:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1784 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1784/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 Fix TestFairScheduler after YARN-2012
 -

 Key: YARN-2105
 URL: https://issues.apache.org/jira/browse/YARN-2105
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Ashwin Shankar
 Fix For: 2.5.0

 Attachments: YARN-2105-v1.txt


 The following tests fail in trunk:
 {code}
 Failed tests:
   TestFairScheduler.testDontAllowUndeclaredPools:2412 expected:1 but was:0
 Tests in error:
   TestFairScheduler.testQueuePlacementWithPolicy:624 NullPointer
   TestFairScheduler.testNotUserAsDefaultQueue:530 » NullPointer
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2012) Fair Scheduler: allow default queue placement rule to take an arbitrary queue

2014-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011183#comment-14011183
 ] 

Hudson commented on YARN-2012:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1784 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1784/])
YARN-2105. Fix TestFairScheduler after YARN-2012. (Ashwin Shankar via Sandy 
Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1597902)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueuePlacementRule.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java


 Fair Scheduler: allow default queue placement rule to take an arbitrary queue
 -

 Key: YARN-2012
 URL: https://issues.apache.org/jira/browse/YARN-2012
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Fix For: 2.5.0

 Attachments: YARN-2012-v1.txt, YARN-2012-v2.txt, YARN-2012-v3.txt


 Currently 'default' rule in queue placement policy,if applied,puts the app in 
 root.default queue. It would be great if we can make 'default' rule 
 optionally point to a different queue as default queue .
 This default queue can be a leaf queue or it can also be an parent queue if 
 the 'default' rule is nested inside nestedUserQueue rule(YARN-1864).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1338) Recover localized resource cache state upon nodemanager restart

2014-05-28 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1338:
-

Attachment: YARN-1338v6.patch

Thanks for the additional comments, Junping.

bq. Do we have any code to destroy DB items for NMState when NM is 
decommissioned (not expecting short-term restart)?

Good point.  I added shutdown code that removes the recovery directory if the 
shutdown is due to a decommission.  I also added a unit test for this scenario.

{quote}
In LocalResourcesTrackerImpl#recoverResource()

+incrementFileCountForLocalCacheDirectory(localDir.getParent());

Given localDir is already the parent of localPath, may be we should just 
increment locaDir rather than its parent? I didn't see we have unit test to 
check file count for resource directory after recovery. May be we should add 
some?
{quote}

The last component of localDir is the unique resource ID and not a directory 
managed by the local cache directory manager.  The directory allocated by the 
local cache directory manager has an additional directory added by the 
localization process which is named after the unique ID for the local resource. 
 For example, the localPath might be something like 
/local/root/0/1/52/resource.jar and localDir is /local/root/0/1/52.  The '52' 
is the unique resource ID (always = 10 so it can't conflict with 
single-character cache mgr subdirs) and /local/root/0/1 is the directory 
managed by the local dir cache manager.  If we passed localDir to the local dir 
cache manager it would get confused since it would try to parse the last 
component as a subdirectory it created but it isn't that.

I did add a unit test to verify local cache directory counts are incremented 
properly when resources are recovered.  This required exposing a couple of 
methods as package-private to get the necessary information for the test.

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1338.patch, YARN-1338v2.patch, 
 YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, 
 YARN-1338v6.patch


 Today when node manager restarts we clean up all the distributed cache files 
 from disk. This is definitely not ideal from 2 aspects.
 * For work preserving restart we definitely want them as running containers 
 are using them
 * For even non work preserving restart this will be useful in the sense that 
 we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2109:
---

 Summary: TestRM fails some tests when some tests run with 
CapacityScheduler and some with FairScheduler
 Key: YARN-2109
 URL: https://issues.apache.org/jira/browse/YARN-2109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Anubhav Dhoot


testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
[YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
it to be CapacityScheduler. But if the default scheduler is set to 
FairScheduler then the rest of the tests that execute after this will fail with 
invalid cast exceptions when getting queuemetrics. This is based on test 
execution order as only the tests that execute after this test will fail. This 
is because the queuemetrics will be initialized by this test to QueueMetrics 
and shared by the subsequent tests. 

We can explicitly clear the metrics at the end of this test to fix this.
For example

java.lang.ClassCastException: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot be 
cast to 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:90)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:85)
at 
org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:81)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-28 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011320#comment-14011320
 ] 

Tsuyoshi OZAWA commented on YARN-1474:
--

Thanks Karthik for the comments. I'd like to make sure one point:

{quote}
1. In each of the schedulers, I don't think we need the following snippet or 
for that matter the variable initialized at all. reinitialize() would have just 
the contents of else-block.
{quote}

If we change that {{reinitialize()}} would have just contents of else-black, we 
need to change lots schedulers-related test cases without 
ResourceManager/MockRM to call {{scheduler.init()}} right after 
{{scheduler.setRMContext()}}. Is it acceptable for us?

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
 YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1338) Recover localized resource cache state upon nodemanager restart

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011325#comment-14011325
 ] 

Hadoop QA commented on YARN-1338:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647161/YARN-1338v6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3844//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3844//console

This message is automatically generated.

 Recover localized resource cache state upon nodemanager restart
 ---

 Key: YARN-1338
 URL: https://issues.apache.org/jira/browse/YARN-1338
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-1338.patch, YARN-1338v2.patch, 
 YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, 
 YARN-1338v6.patch


 Today when node manager restarts we clean up all the distributed cache files 
 from disk. This is definitely not ideal from 2 aspects.
 * For work preserving restart we definitely want them as running containers 
 are using them
 * For even non work preserving restart this will be useful in the sense that 
 we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011331#comment-14011331
 ] 

Karthik Kambatla commented on YARN-1474:


I think that is the step in the right direction. I agree it is a change in 
semantics. Might be a good idea to see what others think.

[~sandyr], [~vinodkv] - do you guys think it is okay to change the semantics on 
how a scheduler is used:
- Before this patch, we create a scheduler and call reinitialize().
- After this patch, I am proposing scheduler.setRMContext(), scheduler.init(), 
and then scheduler.reinitialize() for later updates to allocation-files etc.

Scheduler initialization is within the RM, and we haven't exposed the scheduler 
API for users to write custom schedulers yet. 

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
 YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-596:
-

Attachment: YARN-596.patch

Thanks, Sandy. Upload a new patch to fix your comments.

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011335#comment-14011335
 ] 

Xuan Gong commented on YARN-2054:
-

[~kasha] Looks like you need to update the patch. There are lots of unrelated 
changes..

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch


 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-596:
-

Attachment: YARN-596.patch

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011355#comment-14011355
 ] 

Jian He commented on YARN-2010:
---

bq. The stack trace corresponds to non-work-preserving restart. I am not sure I 
understand the concern.
What I meant is, in this scenario, it shouldn't matter whether the old attempt 
has the master key or not, since the old attempt will be anyways killed by NM 
on RM restart. The newly started attempt will have the proper master key 
generated. If we just check whether the key is null and move on, the next 
attempt should be able to succeed. So we don't need to explicitly fail the app ?

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011362#comment-14011362
 ] 

Karthik Kambatla commented on YARN-2010:


I see. Thanks for the input. Let me check if that is indeed the case, and 
attempt recovering the app even if the key is null

Regardless, do we agree that we still need to address the case where the app 
recovery fails for potentially other reasons?

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2107) Refactor timeline classes into server.timeline package

2014-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011429#comment-14011429
 ] 

Hudson commented on YARN-2107:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5616 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5616/])
YARN-2107. Refactored timeline classes into o.a.h.y.s.timeline package. 
Contributed by Vinod Kumar Vavilapalli. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598094)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/timeline
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/AHSWebApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TimelineWebServices.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/EntityIdentifier.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/GenericObjectMapper.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/NameValuePair.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineReader.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineWriter.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/package-info.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilter.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineAuthenticationFilterInitializer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineClientAuthenticationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineDelegationTokenSecretManagerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp
* 

[jira] [Assigned] (YARN-2098) App priority support in Fair Scheduler

2014-05-28 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan reassigned YARN-2098:
-

Assignee: Wei Yan

 App priority support in Fair Scheduler
 --

 Key: YARN-2098
 URL: https://issues.apache.org/jira/browse/YARN-2098
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.5.0
Reporter: Ashwin Shankar
Assignee: Wei Yan

 This jira is created for supporting app priorities in fair scheduler. 
 AppSchedulable hard codes priority of apps to 1,we should
 change this to get priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)
Anubhav Dhoot created YARN-2110:
---

 Summary: TestAMRestart#testAMRestartWithExistingContainers assumes 
CapacityScheduler
 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: The TestAMRestart#testAMRestartWithExistingContainers 
does a cast to CapacityScheduler in a couple of places
{code}
((CapacityScheduler) rm1.getResourceScheduler())
{code}

If run with FairScheduler as default scheduler the test throws 
{code} java.lang.ClassCastException {code}.
Reporter: Anubhav Dhoot






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2110:


Description: 
The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
CapacityScheduler in a couple of places
{code}
((CapacityScheduler) rm1.getResourceScheduler())
{code}

If run with FairScheduler as default scheduler the test throws 
{code} java.lang.ClassCastException {code}.
Environment: (was: The 
TestAMRestart#testAMRestartWithExistingContainers does a cast to 
CapacityScheduler in a couple of places
{code}
((CapacityScheduler) rm1.getResourceScheduler())
{code}

If run with FairScheduler as default scheduler the test throws 
{code} java.lang.ClassCastException {code}.)

 TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
 ---

 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot

 The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
 CapacityScheduler in a couple of places
 {code}
 ((CapacityScheduler) rm1.getResourceScheduler())
 {code}
 If run with FairScheduler as default scheduler the test throws 
 {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011507#comment-14011507
 ] 

Sandy Ryza commented on YARN-1474:
--

My opinion is that it's ok to change these semantics, as ResourceScheduler is 
marked Evolving.  Given the complexity of writing a YARN scheduler, I also 
seriously doubt that there are custom ones out there outside of academic 
contexts, so I'm comfortable erring on the opposite side of caution. 

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
 YARN-1474.17.patch, YARN-1474.2.patch, YARN-1474.3.patch, YARN-1474.4.patch, 
 YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, YARN-1474.8.patch, 
 YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-2068) FairScheduler uses the same ResourceCalculator for all policies

2014-05-28 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-2068.
--

Resolution: Invalid

Closing this as invalid.  Obviously feel free to reopen if I'm missing 
something.

 FairScheduler uses the same ResourceCalculator for all policies
 ---

 Key: YARN-2068
 URL: https://issues.apache.org/jira/browse/YARN-2068
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 FairScheduler uses the same ResourceCalculator for all policies including 
 DRF. Need to fix that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-05-28 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011515#comment-14011515
 ] 

Vinod Kumar Vavilapalli commented on YARN-1972:
---

Thanks for working on this Remus. Can you upload the short design?

Questions/comments on the approach and the patch in the mean-while

h4. Approach
 - What are the requirements on the NodeManager user? Can it run as a regular 
'yarn' user, spawn the winutils shell and automatically launch task as some 
other user? Is there any admin setup that is needed for this to say grant such 
privileges to 'yarn' user?
 - One reason why we resorted to duplicate most of the code in 
DefaultContainerExecutor in container-executor.c for linux is performance. You 
are launching so many commands for every container - to chown files, to copy 
files etc. You should measure the performance impact of this to figure out if 
what the patch does is fine or if we should imitate what the linux-executor 
does.

h4. Patch
WindowsSecureContainerExecutor
 - The overridden getRunCommand skips things like the setting niceness feature 
(YARN-443) in linux. Arguably this isn't working in non-secure mode before 
anyways. Is there a way we can bump process-priority in windows? If so, when we 
add that feature, we'll need to be careful to change both the default and the 
secure Executor.
 - namenodeGroup - nodeManagerGroup
 -  The division of responsibility between launching multiple commands before 
starting the localizer and the stuff that happens inside the localizer: 
Localizer already does createUserLocalDirs etc. So you don't need to do them 
explicitly in the java code inside NodeManager process.
 - In the minimum we should definitely move exec.localizeClasspathJar() related 
stuff into the winutils start-process code.
 - Why is appLocalizationCounter needed? Once we tackle container-preserving 
NM-restart (YARN-1336), this will be an issue. Why cannot we simply use the 
localizerId? That is unique enough if we want uniqueness.
 - Also the startLocalizer() method is a near clone of what exists in 
LinuxContainerExecutor. We should refactor and reuse, otherwise it will be a 
maintenance headache.

 Implement secure Windows Container Executor
 ---

 Key: YARN-1972
 URL: https://issues.apache.org/jira/browse/YARN-1972
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1972.1.patch


 This work item represents the Java side changes required to implement a 
 secure windows container executor, based on the YARN-1063 changes on 
 native/winutils side. 
 Necessary changes include leveraging the winutils task createas to launch the 
 container process as the required user and a secure localizer (launch 
 localization as a separate process running as the container user).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011523#comment-14011523
 ] 

Sandy Ryza commented on YARN-2026:
--

The nice thing about fair share currently is that it's interpretable as an 
amount of resources that, as long as you stay under, you won't get preempted.   
Changing it to depend on the running apps in the cluster severely complicates 
this.  It used to be that each app and queue's fair share was min'd with its 
resource usage+demand, which is sort of a continuous analog to what you're 
suggesting, but we moved to the current definition when we added multi-resource 
scheduling.

I'm wondering if the right way to solve this problem is to allow preemption to 
be triggered at higher levels in the queue hierarchy.  I.e. suppose we have the 
following situation:
* root has two children - parentA and parentB
* each of root's children has two children - childA1, childA2, childB1, and 
childB2
* the parent queues' minShares are each set to half of the cluster resources
* the child queue' minShares are each set to a quarter of the cluster resources 
* childA1 has a third of the cluster resources
* childB1 and childB2 each have a third of the cluster resources

Even though childA1 is above its fair/minShare, We would see that parentA is 
below its minShare, so we would preempt resources on its behalf.  Once we have 
YARN-596 in, these resources would end up coming from parentB, and end up going 
to childA1.

 Fair scheduler : Fair share for inactive queues causes unfair allocation in 
 some scenarios
 --

 Key: YARN-2026
 URL: https://issues.apache.org/jira/browse/YARN-2026
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2026-v1.txt


 While using hierarchical queues in fair scheduler,there are few scenarios 
 where we have seen a leaf queue with least fair share can take majority of 
 the cluster and starve a sibling parent queue which has greater weight/fair 
 share and preemption doesn’t kick in to reclaim resources.
 The root cause seems to be that fair share of a parent queue is distributed 
 to all its children irrespective of whether its an active or an inactive(no 
 apps running) queue. Preemption based on fair share kicks in only if the 
 usage of a queue is less than 50% of its fair share and if it has demands 
 greater than that. When there are many queues under a parent queue(with high 
 fair share),the child queue’s fair share becomes really low. As a result when 
 only few of these child queues have apps running,they reach their *tiny* fair 
 share quickly and preemption doesn’t happen even if other leaf 
 queues(non-sibling) are hogging the cluster.
 This can be solved by dividing fair share of parent queue only to active 
 child queues.
 Here is an example describing the problem and proposed solution:
 root.lowPriorityQueue is a leaf queue with weight 2
 root.HighPriorityQueue is parent queue with weight 8
 root.HighPriorityQueue has 10 child leaf queues : 
 root.HighPriorityQueue.childQ(1..10)
 Above config,results in root.HighPriorityQueue having 80% fair share
 and each of its ten child queue would have 8% fair share. Preemption would 
 happen only if the child queue is 4% (0.5*8=4). 
 Lets say at the moment no apps are running in any of the 
 root.HighPriorityQueue.childQ(1..10) and few apps are running in 
 root.lowPriorityQueue which is taking up 95% of the cluster.
 Up till this point,the behavior of FS is correct.
 Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
 of the cluster. It would get only the available 5% in the cluster and 
 preemption wouldn't kick in since its above 4%(half fair share).This is bad 
 considering childQ1 is under a highPriority parent queue which has *80% fair 
 share*.
 Until root.lowPriorityQueue starts relinquishing containers,we would see the 
 following allocation on the scheduler page:
 *root.lowPriorityQueue = 95%*
 *root.HighPriorityQueue.childQ1=5%*
 This can be solved by distributing a parent’s fair share only to active 
 queues.
 So in the example above,since childQ1 is the only active queue
 under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
 80%.
 This would cause preemption to reclaim the 30% needed by childQ1 from 
 root.lowPriorityQueue after fairSharePreemptionTimeout seconds.
 Also note that similar situation can happen between 
 root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck 
 at 5%,until childQ2 starts relinquishing containers. We would like each of 
 

[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user

2014-05-28 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011524#comment-14011524
 ] 

Vinod Kumar Vavilapalli commented on YARN-1063:
---

Scanned through the patch. It's dense and full of windows related stuff which I 
am not entirely familiar with.

Looked at the code from YARN container localization and launch POV. I have 
posted some comments on YARN-1972 which may cause some changes here too.

 Winutils needs ability to create task as domain user
 

 Key: YARN-1063
 URL: https://issues.apache.org/jira/browse/YARN-1063
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
 Environment: Windows
Reporter: Kyle Leckie
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, 
 YARN-1063.patch


 h1. Summary:
 Securing a Hadoop cluster requires constructing some form of security 
 boundary around the processes executed in YARN containers. Isolation based on 
 Windows user isolation seems most feasible. This approach is similar to the 
 approach taken by the existing LinuxContainerExecutor. The current patch to 
 winutils.exe adds the ability to create a process as a domain user. 
 h1. Alternative Methods considered:
 h2. Process rights limited by security token restriction:
 On Windows access decisions are made by examining the security token of a 
 process. It is possible to spawn a process with a restricted security token. 
 Any of the rights granted by SIDs of the default token may be restricted. It 
 is possible to see this in action by examining the security tone of a 
 sandboxed process launch be a web browser. Typically the launched process 
 will have a fully restricted token and need to access machine resources 
 through a dedicated broker process that enforces a custom security policy. 
 This broker process mechanism would break compatibility with the typical 
 Hadoop container process. The Container process must be able to utilize 
 standard function calls for disk and network IO. I performed some work 
 looking at ways to ACL the local files to the specific launched without 
 granting rights to other processes launched on the same machine but found 
 this to be an overly complex solution. 
 h2. Relying on APP containers:
 Recent versions of windows have the ability to launch processes within an 
 isolated container. Application containers are supported for execution of 
 WinRT based executables. This method was ruled out due to the lack of 
 official support for standard windows APIs. At some point in the future 
 windows may support functionality similar to BSD jails or Linux containers, 
 at that point support for containers should be added.
 h1. Create As User Feature Description:
 h2. Usage:
 A new sub command was added to the set of task commands. Here is the syntax:
 winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE]
 Some notes:
 * The username specified is in the format of user@domain
 * The machine executing this command must be joined to the domain of the user 
 specified
 * The domain controller must allow the account executing the command access 
 to the user information. For this join the account to the predefined group 
 labeled Pre-Windows 2000 Compatible Access
 * The account running the command must have several rights on the local 
 machine. These can be managed manually using secpol.msc: 
 ** Act as part of the operating system - SE_TCB_NAME
 ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME
 ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME
 * The launched process will not have rights to the desktop so will not be 
 able to display any information or create UI.
 * The launched process will have no network credentials. Any access of 
 network resources that requires domain authentication will fail.
 h2. Implementation:
 Winutils performs the following steps:
 # Enable the required privileges for the current process.
 # Register as a trusted process with the Local Security Authority (LSA).
 # Create a new logon for the user passed on the command line.
 # Load/Create a profile on the local machine for the new logon.
 # Create a new environment for the new logon.
 # Launch the new process in a job with the task name specified and using the 
 created logon.
 # Wait for the JOB to exit.
 h2. Future work:
 The following work was scoped out of this check in:
 * Support for non-domain users or machine that are not domain joined.
 * Support for privilege isolation by running the task launcher in a high 
 privilege service with access over an ACLed named pipe.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores

2014-05-28 Thread Sandy Ryza (JIRA)
Sandy Ryza created YARN-2111:


 Summary: In FairScheduler.attemptScheduling, we won't count 
containers as assigned if they have 0 memory but non-zero cores
 Key: YARN-2111
 URL: https://issues.apache.org/jira/browse/YARN-2111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Sandy Ryza


{code}
if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
  queueMgr.getRootQueue().assignContainer(node),
  Resources.none())) {
{code}

As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores here 
into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores

2014-05-28 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned YARN-2111:


Assignee: Sandy Ryza

 In FairScheduler.attemptScheduling, we won't count containers as assigned if 
 they have 0 memory but non-zero cores
 --

 Key: YARN-2111
 URL: https://issues.apache.org/jira/browse/YARN-2111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 {code}
 if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
   queueMgr.getRootQueue().assignContainer(node),
   Resources.none())) {
 {code}
 As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
 here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores

2014-05-28 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-2111:
-

Attachment: YARN-2111.patch

 In FairScheduler.attemptScheduling, we won't count containers as assigned if 
 they have 0 memory but non-zero cores
 --

 Key: YARN-2111
 URL: https://issues.apache.org/jira/browse/YARN-2111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-2111.patch


 {code}
 if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
   queueMgr.getRootQueue().assignContainer(node),
   Resources.none())) {
 {code}
 As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
 here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011540#comment-14011540
 ] 

Sandy Ryza commented on YARN-596:
-

+1

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011542#comment-14011542
 ] 

Sandy Ryza commented on YARN-596:
-

(pending Jenkins)

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2098) App priority support in Fair Scheduler

2014-05-28 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2098:
--

Attachment: YARN-2098.patch

 App priority support in Fair Scheduler
 --

 Key: YARN-2098
 URL: https://issues.apache.org/jira/browse/YARN-2098
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.5.0
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2098.patch


 This jira is created for supporting app priorities in fair scheduler. 
 AppSchedulable hard codes priority of apps to 1,we should
 change this to get priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He reassigned YARN-2110:
-

Assignee: Chen He

 TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
 ---

 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Chen He

 The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
 CapacityScheduler in a couple of places
 {code}
 ((CapacityScheduler) rm1.getResourceScheduler())
 {code}
 If run with FairScheduler as default scheduler the test throws 
 {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011571#comment-14011571
 ] 

Chen He commented on YARN-2110:
---

I will take this. 

 TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
 ---

 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Chen He

 The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
 CapacityScheduler in a couple of places
 {code}
 ((CapacityScheduler) rm1.getResourceScheduler())
 {code}
 If run with FairScheduler as default scheduler the test throws 
 {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler

2014-05-28 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011582#comment-14011582
 ] 

Chen He commented on YARN-2109:
---

This is interesting and I will work on it.

 TestRM fails some tests when some tests run with CapacityScheduler and some 
 with FairScheduler
 --

 Key: YARN-2109
 URL: https://issues.apache.org/jira/browse/YARN-2109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Anubhav Dhoot

 testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
 [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
 it to be CapacityScheduler. But if the default scheduler is set to 
 FairScheduler then the rest of the tests that execute after this will fail 
 with invalid cast exceptions when getting queuemetrics. This is based on test 
 execution order as only the tests that execute after this test will fail. 
 This is because the queuemetrics will be initialized by this test to 
 QueueMetrics and shared by the subsequent tests. 
 We can explicitly clear the metrics at the end of this test to fix this.
 For example
 java.lang.ClassCastException: 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot 
 be cast to 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:90)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:85)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:81)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler

2014-05-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He reassigned YARN-2109:
-

Assignee: Chen He

 TestRM fails some tests when some tests run with CapacityScheduler and some 
 with FairScheduler
 --

 Key: YARN-2109
 URL: https://issues.apache.org/jira/browse/YARN-2109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Anubhav Dhoot
Assignee: Chen He

 testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
 [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
 it to be CapacityScheduler. But if the default scheduler is set to 
 FairScheduler then the rest of the tests that execute after this will fail 
 with invalid cast exceptions when getting queuemetrics. This is based on test 
 execution order as only the tests that execute after this test will fail. 
 This is because the queuemetrics will be initialized by this test to 
 QueueMetrics and shared by the subsequent tests. 
 We can explicitly clear the metrics at the end of this test to fix this.
 For example
 java.lang.ClassCastException: 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot 
 be cast to 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:90)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:85)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:81)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1913) With Fair Scheduler, cluster can logjam when all resources are consumed by AMs

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011587#comment-14011587
 ] 

Sandy Ryza commented on YARN-1913:
--

I think we should avoid doing approximate calculation through the minimum 
allocation.  We need to handle situations where AM resources are much larger 
than the min, and situations where the minimum allocation will be 0 (common on 
Llama-enabled clusters).

This would have the added benefit of avoiding touching the runnability 
machinery, which is already bordering on over-complicated.

 With Fair Scheduler, cluster can logjam when all resources are consumed by AMs
 --

 Key: YARN-1913
 URL: https://issues.apache.org/jira/browse/YARN-1913
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Wei Yan
  Labels: easyfix
 Attachments: YARN-1913.patch, YARN-1913.patch, YARN-1913.patch, 
 YARN-1913.patch


 It's possible to deadlock a cluster by submitting many applications at once, 
 and have all cluster resources taken up by AMs.
 One solution is for the scheduler to limit resources taken up by AMs, as a 
 percentage of total cluster resources, via a maxApplicationMasterShare 
 config.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1801) NPE in public localizer

2014-05-28 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011601#comment-14011601
 ] 

Jason Lowe commented on YARN-1801:
--

Strictly speaking, the patch does prevent the NPE.  However the public 
localizer is still effectively doomed if this condition occurs because it 
returns from the run() method.  That will shutdown the localizer thread and 
public local resource requests will stop being processed.  In that sense we've 
traded an NPE with a traceback for a one-line log message.  I'm not sure this 
is an improvement, since at least the traceback is easier to notice in the NM 
log and we get a corresponding fatal log when someone goes hunting for what 
went wrong with the public localizer.

The real issue is we need to understand what happened to cause 
pending.remove(completed) to return null.  This should never happen, and if it 
does then it means we have a bug.  Trying to recover from this condition is 
patching a symptom rather than a root cause.  The problem that lead to the null 
request event _might_ have been fixed by YARN-1575 which wasn't present in 2.2 
where the original bug occurred.  It would be interesting to know if this has 
reoccurred since 2.3.0.

Assuming this is still a potential issue, we should either find a way to 
prevent it from ever occurring or recover in a way that keeps the public 
localizer working as much as possible. It'd be great if we could just pull from 
the queue and receive a structure that has both the request event and the 
FuturePath so we don't have to worry about a FuturePath with no associated 
event.  If we're going to try to recover instead, we'd have to log an error and 
try to cleanup.  With no associated request event and no path if we got an 
execution error, it's going to be particularly difficult to recover properly.

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical
 Attachments: YARN-1801.patch


 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2111) In FairScheduler.attemptScheduling, we won't count containers as assigned if they have 0 memory but non-zero cores

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011625#comment-14011625
 ] 

Hadoop QA commented on YARN-2111:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647204/YARN-2111.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3846//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3846//console

This message is automatically generated.

 In FairScheduler.attemptScheduling, we won't count containers as assigned if 
 they have 0 memory but non-zero cores
 --

 Key: YARN-2111
 URL: https://issues.apache.org/jira/browse/YARN-2111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-2111.patch


 {code}
 if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
   queueMgr.getRootQueue().assignContainer(node),
   Resources.none())) {
 {code}
 As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
 here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2111) In FairScheduler.attemptScheduling, we don't count containers as assigned if they have 0 memory but non-zero cores

2014-05-28 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-2111:
-

Summary: In FairScheduler.attemptScheduling, we don't count containers as 
assigned if they have 0 memory but non-zero cores  (was: In 
FairScheduler.attemptScheduling, we won't count containers as assigned if they 
have 0 memory but non-zero cores)

 In FairScheduler.attemptScheduling, we don't count containers as assigned if 
 they have 0 memory but non-zero cores
 --

 Key: YARN-2111
 URL: https://issues.apache.org/jira/browse/YARN-2111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.4.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-2111.patch


 {code}
 if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource,
   queueMgr.getRootQueue().assignContainer(node),
   Resources.none())) {
 {code}
 As RESOURCE_CALCULATOR is a DefaultResourceCalculator, we won't take cores 
 here into account.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

2014-05-28 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2112:
-

 Summary: Hadoop-client is missing jackson libs due to 
inappropriate configs in pom.xml
 Key: YARN-2112
 URL: https://issues.apache.org/jira/browse/YARN-2112
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen


Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
However, the current dependency configurations make the hadoop-client artifect 
miss 2 jackson libs, such that the applications which have hadoop-client 
dependency will see the following exception
{code}
java.lang.NoClassDefFoundError: 
org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.init(TimelineClientImpl.java:92)
at 
org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.mapred.ResourceMgrDelegate.init(ResourceMgrDelegate.java:88)
at org.apache.hadoop.mapred.YARNRunner.init(YARNRunner.java:111)
at 
org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at 
org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
at 
org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at 
org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: 
org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at 

[jira] [Commented] (YARN-2098) App priority support in Fair Scheduler

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011643#comment-14011643
 ] 

Hadoop QA commented on YARN-2098:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647212/YARN-2098.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/3847//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3847//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3847//console

This message is automatically generated.

 App priority support in Fair Scheduler
 --

 Key: YARN-2098
 URL: https://issues.apache.org/jira/browse/YARN-2098
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.5.0
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2098.patch


 This jira is created for supporting app priorities in fair scheduler. 
 AppSchedulable hard codes priority of apps to 1,we should
 change this to get priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

2014-05-28 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2112:
--

Attachment: YARN-2112.1.patch

Create a patch to correct the configs in pom.xml, and make sure all 4 jackson 
libs are available in hadoop-client

 Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
 -

 Key: YARN-2112
 URL: https://issues.apache.org/jira/browse/YARN-2112
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2112.1.patch


 Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
 However, the current dependency configurations make the hadoop-client 
 artifect miss 2 jackson libs, such that the applications which have 
 hadoop-client dependency will see the following exception
 {code}
 java.lang.NoClassDefFoundError: 
 org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
   at 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
   at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.init(TimelineClientImpl.java:92)
   at 
 org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapred.ResourceMgrDelegate.init(ResourceMgrDelegate.java:88)
   at org.apache.hadoop.mapred.YARNRunner.init(YARNRunner.java:111)
   at 
 org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82)
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75)
   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:394)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
   at 
 org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
   at 
 org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at 
 org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 Caused by: java.lang.ClassNotFoundException: 
 org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider
   at 

[jira] [Updated] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

2014-05-28 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2112:
--

Target Version/s: 2.5.0

 Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
 -

 Key: YARN-2112
 URL: https://issues.apache.org/jira/browse/YARN-2112
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2112.1.patch


 Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
 However, the current dependency configurations make the hadoop-client 
 artifect miss 2 jackson libs, such that the applications which have 
 hadoop-client dependency will see the following exception
 {code}
 java.lang.NoClassDefFoundError: 
 org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
   at 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
   at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.init(TimelineClientImpl.java:92)
   at 
 org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapred.ResourceMgrDelegate.init(ResourceMgrDelegate.java:88)
   at org.apache.hadoop.mapred.YARNRunner.init(YARNRunner.java:111)
   at 
 org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82)
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75)
   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:394)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
   at 
 org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
   at 
 org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at 
 org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 Caused by: java.lang.ClassNotFoundException: 
 org.codehaus.jackson.jaxrs.JacksonJaxbJsonProvider
   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
   at java.security.AccessController.doPrivileged(Native Method)
   at 

[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011686#comment-14011686
 ] 

Jian He commented on YARN-2010:
---

I agree that failing to recover an app shouldn’t fail the RM.  I think for 
cases where the failure will be simply resolved by launching a new attempt like 
this, we should not  fail the app. We can fail the app for cases where starting 
a new attempt can’t resolve the issue like failing to renew DT on recovery. 

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2112) Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011706#comment-14011706
 ] 

Hadoop QA commented on YARN-2112:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647222/YARN-2112.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3848//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3848//console

This message is automatically generated.

 Hadoop-client is missing jackson libs due to inappropriate configs in pom.xml
 -

 Key: YARN-2112
 URL: https://issues.apache.org/jira/browse/YARN-2112
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-2112.1.patch


 Now YarnClient is using TimelineClient, which has dependency on jackson libs. 
 However, the current dependency configurations make the hadoop-client 
 artifect miss 2 jackson libs, such that the applications which have 
 hadoop-client dependency will see the following exception
 {code}
 java.lang.NoClassDefFoundError: 
 org/codehaus/jackson/jaxrs/JacksonJaxbJsonProvider
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
   at 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
   at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
   at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
   at 
 org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.init(TimelineClientImpl.java:92)
   at 
 org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:44)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:149)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapred.ResourceMgrDelegate.serviceInit(ResourceMgrDelegate.java:94)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.mapred.ResourceMgrDelegate.init(ResourceMgrDelegate.java:88)
   at org.apache.hadoop.mapred.YARNRunner.init(YARNRunner.java:111)
   at 
 org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82)
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75)
   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
   at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:394)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
   at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
   at 

[jira] [Updated] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2054:
---

Attachment: yarn-2054-4.patch

Sorry for the bulky patch - forgot to rebase against trunk before generating a 
diff against it :)

Here is the right one. 

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, 
 yarn-2054-4.patch


 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2110:
--

Attachment: YARN-2110.patch

 TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
 ---

 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Chen He
 Attachments: YARN-2110.patch


 The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
 CapacityScheduler in a couple of places
 {code}
 ((CapacityScheduler) rm1.getResourceScheduler())
 {code}
 If run with FairScheduler as default scheduler the test throws 
 {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011788#comment-14011788
 ] 

Chen He commented on YARN-2110:
---

change casting from CapacityScheduler to AbstractYarnScheduler which is the 
parent of both FairScheduler and CapacityScheduler.

 TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
 ---

 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Chen He
 Attachments: YARN-2110.patch


 The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
 CapacityScheduler in a couple of places
 {code}
 ((CapacityScheduler) rm1.getResourceScheduler())
 {code}
 If run with FairScheduler as default scheduler the test throws 
 {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2109) TestRM fails some tests when some tests run with CapacityScheduler and some with FairScheduler

2014-05-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2109:
--

Labels: test  (was: )

 TestRM fails some tests when some tests run with CapacityScheduler and some 
 with FairScheduler
 --

 Key: YARN-2109
 URL: https://issues.apache.org/jira/browse/YARN-2109
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Anubhav Dhoot
Assignee: Chen He
  Labels: test

 testNMTokenSentForNormalContainer requires CapacityScheduler and was fixed in 
 [YARN-1846|https://issues.apache.org/jira/browse/YARN-1846] to explicitly set 
 it to be CapacityScheduler. But if the default scheduler is set to 
 FairScheduler then the rest of the tests that execute after this will fail 
 with invalid cast exceptions when getting queuemetrics. This is based on test 
 execution order as only the tests that execute after this test will fail. 
 This is because the queuemetrics will be initialized by this test to 
 QueueMetrics and shared by the subsequent tests. 
 We can explicitly clear the metrics at the end of this test to fix this.
 For example
 java.lang.ClassCastException: 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics cannot 
 be cast to 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:103)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:418)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:808)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:230)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:90)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:85)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.MockRM.init(MockRM.java:81)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRM.testNMToken(TestRM.java:232)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2110:
--

Labels: test  (was: )

 TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
 ---

 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Chen He
  Labels: test
 Attachments: YARN-2110.patch


 The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
 CapacityScheduler in a couple of places
 {code}
 ((CapacityScheduler) rm1.getResourceScheduler())
 {code}
 If run with FairScheduler as default scheduler the test throws 
 {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011807#comment-14011807
 ] 

Ashwin Shankar commented on YARN-596:
-

[~ywskycn],minor comment : can you please update javadoc comment for 
{code:title=FairScheduler.java}
protected void preemptResources(Resource toPreempt)
{code}
it still talks about the previous preemption algorithm.

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011814#comment-14011814
 ] 

Wei Yan commented on YARN-596:
--

Thanks, [~ashwinshankar77]. I'll update a patch quickly.

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2054) Poor defaults for YARN ZK configs for retries and retry-inteval

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011818#comment-14011818
 ] 

Hadoop QA commented on YARN-2054:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647233/yarn-2054-4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3849//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3849//console

This message is automatically generated.

 Poor defaults for YARN ZK configs for retries and retry-inteval
 ---

 Key: YARN-2054
 URL: https://issues.apache.org/jira/browse/YARN-2054
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2054-1.patch, yarn-2054-2.patch, yarn-2054-3.patch, 
 yarn-2054-4.patch


 Currenly, we have the following default values:
 # yarn.resourcemanager.zk-num-retries - 500
 # yarn.resourcemanager.zk-retry-interval-ms - 2000
 This leads to a cumulate 1000 seconds before the RM gives up trying to 
 connect to the ZK. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)

2014-05-28 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011820#comment-14011820
 ] 

Vinod Kumar Vavilapalli commented on YARN-1708:
---

Thanks for the patch [~subru]! I started looking at this. Few comments:

h4. Misc
 - I think we should create a ReservationID or ReservationHandle and use it 
instead of strings
 - ReservationResponse.message - errorMesage? Or Errors?

h4. ApplicationClientProtocol
 - createReservation - submitReservation?
 - Let's have separate request/response records for submission, update and 
deletion of reservations. Deletion of reservations, for e.g only needs to 
supplied a reservationID. See submit/kill app for analogy. Similarly, 
ReservationRequest.reservationID doesn't need to be part of the request for the 
reservation-submission.

h4. ReservationDefinition
 - Seems like there is a notion of absolute time. We should make it clear what 
the arrival/deadline long's really represent. Particularly given the 
possibility of different timezones between the RM and the client.
 - It may be also very useful to let users specify time in relative terms - 
6hrs from now, etc.
 - It let's you specify a list of ResourceRequests. Not sure how we can specify 
things like RR1 for the first 5 mins, RR2 for the next 15 etc.

h4. ReservationDefinitionType
 - It seems like if we instead have a list of records of type (arrival, 
ResourceRequest, deadline), we will cover all the cases in the definition-type 
and then some more? Thoughts?
 - Also any examples of where R_ANY is useful? Similarly as to how R_ORDER is 
not enough and instead we have a need for R_ORDER_NO_GAP? Focusing mainly on 
use-cases here.

h4. ResourceRequest
 - concurrency is really a request for a gang of containers?
 - Meaning of leaseDuration? Is it indicating the scheduler as to how long the 
container will run for?

I have suggestions for configuration props renames follow. We follow a 
component.sub-component.sub-component.property-name convention. (OT: I wish I 
looked at preemption related config names :) ) IAC, I need to see the bigger 
picture with the rest of the patches before I can suggest correct naming, let's 
drop the YarnConfiguration changes from this patch.

Will look more carefully at the PB impls in the next cycle.

bq. The patch posted here is not submitted, since it depends on many other 
patches part of the umbrella JIRA, the separation is designed only for ease of 
reviewing. 
I see this patch to be fairly independent and committable in isolation. Though 
we should wait till we have the entire set to make sure the changes here are 
all sufficient and necessary.

 Add a public API to reserve resources (part of YARN-1051)
 -

 Key: YARN-1708
 URL: https://issues.apache.org/jira/browse/YARN-1708
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Carlo Curino
Assignee: Subramaniam Krishnan
 Attachments: YARN-1708.patch


 This JIRA tracks the definition of a new public API for YARN, which allows 
 users to reserve resources (think of time-bounded queues). This is part of 
 the admission control enhancement proposed in YARN-1051.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1474) Make schedulers services

2014-05-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1474:
-

Attachment: YARN-1474.18.patch

Thanks for sharing the opinions, Sandy and Karthik. I also think it's OK to 
change internal APIs' semantics because its interface is Evolving one. 
[~vinodkv], please let us know if you have additional comments.

Updated patch with following changes to address Karthik's comments:
1. Removed {{initialized}} flag from *Schedulers. All initialization is done in 
{{serviceInit}} and {{serviceStart}}, instead of {{reinitialize()}}.
2. Changed ResourceSchedulerWrapper to override {{serviceInit}}, 
{{serviceStart}}, {{serviceStop}}.
3. Updated some tests to call scheduler.init() right after 
scheduler.setRMContext() without ResourceManager/MockRM.



 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
 YARN-1474.17.patch, YARN-1474.18.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
 YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
 YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN

2014-05-28 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011844#comment-14011844
 ] 

Nishkam Ravi commented on YARN-2041:


Unlike FIFO, whose performance deteriorates consistently across multiple 
benchmarks as value of yarn.nodemanager.resource.memory-mb is increased from 
16GB to 40GB, Capacity scheduler performs well for all benchmarks except for 
TeraValidate. 

For TeraValidate in single-job mode:

Exec. time with Fair: 38 sec (yarn.nodemanager.resource.memory-mb = 16GB)
Exec. time with Fair: 38 sec (yarn.nodemanager.resource.memory-mb = 40GB)
Exec. time with Capacity: 51 sec (yarn.nodemanager.resource.memory-mb = 16GB)
Exec. time with Capacity: 100 sec (yarn.nodemanager.resource.memory-mb = 40GB)

Also, in multi-job mode, Capacity seems to be behaving like FIFO. Scheduling 
one job at a time for execution. 


 Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
 

 Key: YARN-2041
 URL: https://issues.apache.org/jira/browse/YARN-2041
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Nishkam Ravi

 Performance of MR2 jobs falls drastically as YARN config parameter 
 yarn.nodemanager.resource.memory-mb  is increased beyond a certain value. 
 Performance of Spark falls drastically as the value of 
 yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a 
 large data set.
 This makes it hard to co-locate MR2 and Spark jobs in YARN.
 The experiments are being conducted on a 6-node cluster. The following 
 workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, 
 ShuffleText and PageRank.
 Will add more details to this JIRA over time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-05-28 Thread Ashwin Shankar (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2026:
-

Description: 
Problem1- While using hierarchical queues in fair scheduler,there are few 
scenarios where we have seen a leaf queue with least fair share can take 
majority of the cluster and starve a sibling parent queue which has greater 
weight/fair share and preemption doesn’t kick in to reclaim resources.

The root cause seems to be that fair share of a parent queue is distributed to 
all its children irrespective of whether its an active or an inactive(no apps 
running) queue. Preemption based on fair share kicks in only if the usage of a 
queue is less than 50% of its fair share and if it has demands greater than 
that. When there are many queues under a parent queue(with high fair share),the 
child queue’s fair share becomes really low. As a result when only few of these 
child queues have apps running,they reach their *tiny* fair share quickly and 
preemption doesn’t happen even if other leaf queues(non-sibling) are hogging 
the cluster.

This can be solved by dividing fair share of parent queue only to active child 
queues.

Here is an example describing the problem and proposed solution:
root.lowPriorityQueue is a leaf queue with weight 2
root.HighPriorityQueue is parent queue with weight 8
root.HighPriorityQueue has 10 child leaf queues : 
root.HighPriorityQueue.childQ(1..10)

Above config,results in root.HighPriorityQueue having 80% fair share
and each of its ten child queue would have 8% fair share. Preemption would 
happen only if the child queue is 4% (0.5*8=4). 

Lets say at the moment no apps are running in any of the 
root.HighPriorityQueue.childQ(1..10) and few apps are running in 
root.lowPriorityQueue which is taking up 95% of the cluster.
Up till this point,the behavior of FS is correct.

Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of 
the cluster. It would get only the available 5% in the cluster and preemption 
wouldn't kick in since its above 4%(half fair share).This is bad considering 
childQ1 is under a highPriority parent queue which has *80% fair share*.

Until root.lowPriorityQueue starts relinquishing containers,we would see the 
following allocation on the scheduler page:
*root.lowPriorityQueue = 95%*
*root.HighPriorityQueue.childQ1=5%*

This can be solved by distributing a parent’s fair share only to active queues.

So in the example above,since childQ1 is the only active queue
under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%.
This would cause preemption to reclaim the 30% needed by childQ1 from 
root.lowPriorityQueue after fairSharePreemptionTimeout seconds.

Problem2 - Also note that similar situation can happen between 
root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 
hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 
5%,until childQ2 starts relinquishing containers. We would like each of childQ1 
and childQ2 to get half of root.HighPriorityQueue  fair share ie 40%,which 
would ensure childQ1 gets upto 40% resource if needed through preemption.

  was:
While using hierarchical queues in fair scheduler,there are few scenarios where 
we have seen a leaf queue with least fair share can take majority of the 
cluster and starve a sibling parent queue which has greater weight/fair share 
and preemption doesn’t kick in to reclaim resources.

The root cause seems to be that fair share of a parent queue is distributed to 
all its children irrespective of whether its an active or an inactive(no apps 
running) queue. Preemption based on fair share kicks in only if the usage of a 
queue is less than 50% of its fair share and if it has demands greater than 
that. When there are many queues under a parent queue(with high fair share),the 
child queue’s fair share becomes really low. As a result when only few of these 
child queues have apps running,they reach their *tiny* fair share quickly and 
preemption doesn’t happen even if other leaf queues(non-sibling) are hogging 
the cluster.

This can be solved by dividing fair share of parent queue only to active child 
queues.

Here is an example describing the problem and proposed solution:
root.lowPriorityQueue is a leaf queue with weight 2
root.HighPriorityQueue is parent queue with weight 8
root.HighPriorityQueue has 10 child leaf queues : 
root.HighPriorityQueue.childQ(1..10)

Above config,results in root.HighPriorityQueue having 80% fair share
and each of its ten child queue would have 8% fair share. Preemption would 
happen only if the child queue is 4% (0.5*8=4). 

Lets say at the moment no apps are running in any of the 
root.HighPriorityQueue.childQ(1..10) and few apps are running in 
root.lowPriorityQueue which is taking up 95% of the cluster.
Up till this point,the behavior of FS is correct.

Now,lets say 

[jira] [Commented] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011847#comment-14011847
 ] 

Hadoop QA commented on YARN-2110:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647243/YARN-2110.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3850//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3850//console

This message is automatically generated.

 TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
 ---

 Key: YARN-2110
 URL: https://issues.apache.org/jira/browse/YARN-2110
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Chen He
  Labels: test
 Attachments: YARN-2110.patch


 The TestAMRestart#testAMRestartWithExistingContainers does a cast to 
 CapacityScheduler in a couple of places
 {code}
 ((CapacityScheduler) rm1.getResourceScheduler())
 {code}
 If run with FairScheduler as default scheduler the test throws 
 {code} java.lang.ClassCastException {code}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-596:
-

Attachment: YARN-596.patch

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1474) Make schedulers services

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011879#comment-14011879
 ] 

Hadoop QA commented on YARN-1474:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647249/YARN-1474.18.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 9 
warning messages.
See 
https://builds.apache.org/job/PreCommit-YARN-Build/3851//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-sls 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3851//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3851//console

This message is automatically generated.

 Make schedulers services
 

 Key: YARN-1474
 URL: https://issues.apache.org/jira/browse/YARN-1474
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: scheduler
Affects Versions: 2.3.0, 2.4.0
Reporter: Sandy Ryza
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1474.1.patch, YARN-1474.10.patch, 
 YARN-1474.11.patch, YARN-1474.12.patch, YARN-1474.13.patch, 
 YARN-1474.14.patch, YARN-1474.15.patch, YARN-1474.16.patch, 
 YARN-1474.17.patch, YARN-1474.18.patch, YARN-1474.2.patch, YARN-1474.3.patch, 
 YARN-1474.4.patch, YARN-1474.5.patch, YARN-1474.6.patch, YARN-1474.7.patch, 
 YARN-1474.8.patch, YARN-1474.9.patch


 Schedulers currently have a reinitialize but no start and stop.  Fitting them 
 into the YARN service model would make things more coherent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters

2014-05-28 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2091:
-

Attachment: YARN-2091.1.patch

Added ContainerExitStatus.KILL_EXCEEDED_MEMORY and test to pass the exit status 
from NM to RM correctly.

 Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
 ---

 Key: YARN-2091
 URL: https://issues.apache.org/jira/browse/YARN-2091
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2091.1.patch


 Currently, the AM cannot programmatically determine if the task was killed 
 due to using excessive memory. The NM kills it without passing this 
 information in the container status back to the RM. So the AM cannot take any 
 action here. The jira tracks adding this exit status and passing it from the 
 NM to the RM and then the AM. In general, there may be other such actions 
 taken by YARN that are currently opaque to the AM. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---

Attachment: yarn-2010-3.patch

New patch that gets rid of the config and addresses the issue where the 
masterKey is null. 

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
 yarn-2010-3.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011906#comment-14011906
 ] 

Hadoop QA commented on YARN-596:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647256/YARN-596.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3852//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3852//console

This message is automatically generated.

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios

2014-05-28 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011907#comment-14011907
 ] 

Ashwin Shankar commented on YARN-2026:
--

Hi [~sandyr],
bq. We would see that parentA is below its minShare, so we would preempt 
resources on its behalf. 
minShare preemption at parent queue is not yet implemented 
,FairScheduler.resToPreempt() is not recursive(YARN-596 doesn't address this).
I had created YARN-1961 for this purpose,which I plan to work on.
But yes you are right,if YARN-1961 is in place, we can set minShare and 
minShareTimeout at parentA,which would
reclaim resource from parentB.

This solves problem-1 in the description,but what about problem-2 ?
When we have many leaf queues under a parent,say using NestedUserQueue rule.
Eg.
 - parentA has 100 user queues under it
 - fair share of each user queue is 1% of parentA(assuming weight=1)
 - Say user queue parentA.user1 is taking up 100% of cluster since its the only 
active queue.
 - parentA.user2 which was inactive till now ,submits a job and needs say 20%.
 - parentA.user2 would get only 1% through preemption and parentA.user1 would 
have 99%.
  This seems unfair considering users have equal weight. Eventually,as user1 
releases its containers,
  it would go to user2,but until that happens user1 can hog the cluster.

In our cluster we have about 200 users(so 200 user queues),but only about 
20%(avg) are active
at a point in time. Fair share for each user becomes really low (1/200)*parent 
and can causes
this 'unfairness' mentioned in above example.
This can be solved by dividing fair share only to active queues.

How about this,can we have a new property say 'fairShareForActiveQueues' which 
turns on/off this feature,that way people
who need it can use it and other's can turn it off and would get the usual 
static fair share behavior.
Thoughts ?

 Fair scheduler : Fair share for inactive queues causes unfair allocation in 
 some scenarios
 --

 Key: YARN-2026
 URL: https://issues.apache.org/jira/browse/YARN-2026
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
  Labels: scheduler
 Attachments: YARN-2026-v1.txt


 Problem1- While using hierarchical queues in fair scheduler,there are few 
 scenarios where we have seen a leaf queue with least fair share can take 
 majority of the cluster and starve a sibling parent queue which has greater 
 weight/fair share and preemption doesn’t kick in to reclaim resources.
 The root cause seems to be that fair share of a parent queue is distributed 
 to all its children irrespective of whether its an active or an inactive(no 
 apps running) queue. Preemption based on fair share kicks in only if the 
 usage of a queue is less than 50% of its fair share and if it has demands 
 greater than that. When there are many queues under a parent queue(with high 
 fair share),the child queue’s fair share becomes really low. As a result when 
 only few of these child queues have apps running,they reach their *tiny* fair 
 share quickly and preemption doesn’t happen even if other leaf 
 queues(non-sibling) are hogging the cluster.
 This can be solved by dividing fair share of parent queue only to active 
 child queues.
 Here is an example describing the problem and proposed solution:
 root.lowPriorityQueue is a leaf queue with weight 2
 root.HighPriorityQueue is parent queue with weight 8
 root.HighPriorityQueue has 10 child leaf queues : 
 root.HighPriorityQueue.childQ(1..10)
 Above config,results in root.HighPriorityQueue having 80% fair share
 and each of its ten child queue would have 8% fair share. Preemption would 
 happen only if the child queue is 4% (0.5*8=4). 
 Lets say at the moment no apps are running in any of the 
 root.HighPriorityQueue.childQ(1..10) and few apps are running in 
 root.lowPriorityQueue which is taking up 95% of the cluster.
 Up till this point,the behavior of FS is correct.
 Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% 
 of the cluster. It would get only the available 5% in the cluster and 
 preemption wouldn't kick in since its above 4%(half fair share).This is bad 
 considering childQ1 is under a highPriority parent queue which has *80% fair 
 share*.
 Until root.lowPriorityQueue starts relinquishing containers,we would see the 
 following allocation on the scheduler page:
 *root.lowPriorityQueue = 95%*
 *root.HighPriorityQueue.childQ1=5%*
 This can be solved by distributing a parent’s fair share only to active 
 queues.
 So in the example above,since childQ1 is the only active queue
 under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 
 80%.
 This would cause preemption 

[jira] [Commented] (YARN-2091) Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011918#comment-14011918
 ] 

Hadoop QA commented on YARN-2091:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647261/YARN-2091.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3853//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3853//console

This message is automatically generated.

 Add ContainerExitStatus.KILL_EXCEEDED_MEMORY and pass it to app masters
 ---

 Key: YARN-2091
 URL: https://issues.apache.org/jira/browse/YARN-2091
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Bikas Saha
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2091.1.patch


 Currently, the AM cannot programmatically determine if the task was killed 
 due to using excessive memory. The NM kills it without passing this 
 information in the container status back to the RM. So the AM cannot take any 
 action here. The jira tracks adding this exit status and passing it from the 
 NM to the RM and then the AM. In general, there may be other such actions 
 taken by YARN that are currently opaque to the AM. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-05-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011943#comment-14011943
 ] 

Hadoop QA commented on YARN-2010:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12647268/yarn-2010-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3854//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3854//console

This message is automatically generated.

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Rohith
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
 yarn-2010-3.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at 

[jira] [Commented] (YARN-2041) Hard to co-locate MR2 and Spark jobs on the same cluster in YARN

2014-05-28 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011982#comment-14011982
 ] 

Vinod Kumar Vavilapalli commented on YARN-2041:
---

Tx for all the updates, [~nravi], but can you please make clear the issues that 
you think are needed to be fixed?

 Hard to co-locate MR2 and Spark jobs on the same cluster in YARN
 

 Key: YARN-2041
 URL: https://issues.apache.org/jira/browse/YARN-2041
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Nishkam Ravi

 Performance of MR2 jobs falls drastically as YARN config parameter 
 yarn.nodemanager.resource.memory-mb  is increased beyond a certain value. 
 Performance of Spark falls drastically as the value of 
 yarn.nodemanager.resource.memory-mb is decreased beyond a certain value for a 
 large data set.
 This makes it hard to co-locate MR2 and Spark jobs in YARN.
 The experiments are being conducted on a 6-node cluster. The following 
 workloads are being run: TeraGen, TeraSort, TeraValidate, WordCount, 
 ShuffleText and PageRank.
 Will add more details to this JIRA over time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) In fair scheduler, intra-application container priorities affect inter-application preemption decisions

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012030#comment-14012030
 ] 

Sandy Ryza commented on YARN-596:
-

+1

 In fair scheduler, intra-application container priorities affect 
 inter-application preemption decisions
 ---

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-596) Use scheduling policies throughout the hierarchy to decide which containers to preempt

2014-05-28 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-596:


Summary: Use scheduling policies throughout the hierarchy to decide which 
containers to preempt  (was: In fair scheduler, intra-application container 
priorities affect inter-application preemption decisions)

 Use scheduling policies throughout the hierarchy to decide which containers 
 to preempt
 --

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt

2014-05-28 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-596:


Summary: Use scheduling policies throughout the queue hierarchy to decide 
which containers to preempt  (was: Use scheduling policies throughout the 
hierarchy to decide which containers to preempt)

 Use scheduling policies throughout the queue hierarchy to decide which 
 containers to preempt
 

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012038#comment-14012038
 ] 

Sandy Ryza commented on YARN-596:
-

I just committed this to trunk and branch-2.  Thanks Wei for the patch and 
Ashwin for taking a look.

 Use scheduling policies throughout the queue hierarchy to decide which 
 containers to preempt
 

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 2.5.0

 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-596) Use scheduling policies throughout the queue hierarchy to decide which containers to preempt

2014-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012052#comment-14012052
 ] 

Hudson commented on YARN-596:
-

SUCCESS: Integrated in Hadoop-trunk-Commit #5619 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5619/])
YARN-596. Use scheduling policies throughout the queue hierarchy to decide 
which containers to preempt (Wei Yan via Sandy Ryza) (sandy: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1598197)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSSchedulerApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java


 Use scheduling policies throughout the queue hierarchy to decide which 
 containers to preempt
 

 Key: YARN-596
 URL: https://issues.apache.org/jira/browse/YARN-596
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.0.3-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 2.5.0

 Attachments: YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, 
 YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch, YARN-596.patch


 In the fair scheduler, containers are chosen for preemption in the following 
 way:
 All containers for all apps that are in queues that are over their fair share 
 are put in a list.
 The list is sorted in order of the priority that the container was requested 
 in.
 This means that an application can shield itself from preemption by 
 requesting it's containers at higher priorities, which doesn't really make 
 sense.
 Also, an application that is not over its fair share, but that is in a queue 
 that is over it's fair share is just as likely to have containers preempted 
 as an application that is over its fair share.



--
This message was sent by Atlassian JIRA
(v6.2#6252)