[jira] [Updated] (YARN-1948) Expose utility methods in Apps.java publically

2015-06-16 Thread nijel (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nijel updated YARN-1948:

Attachment: YARN-1948-1.patch

Attached the file with modification
Please review

 Expose utility methods in Apps.java publically
 --

 Key: YARN-1948
 URL: https://issues.apache.org/jira/browse/YARN-1948
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Affects Versions: 2.4.0
Reporter: Sandy Ryza
Assignee: nijel
  Labels: newbie
 Attachments: YARN-1948-1.patch


 Apps.setEnvFromInputString and Apps.addToEnvironment are methods used by 
 MapReduce, Spark, and Tez that are currently marked private.  As these are 
 useful for any YARN app that wants to allow users to augment container 
 environments, it would be helpful to make them public.
 It may make sense to put them in a new class with a better name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587975#comment-14587975
 ] 

Hudson commented on YARN-3711:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2158 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2158/])
YARN-3711. Documentation of ResourceManager HA should explain configurations 
about listen addresses. Contributed by Masatake Iwasaki. (ozawa: rev 
e8c514373f2d258663497a33ffb3b231d0743b57)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerHA.md


 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Improve logs for LeafQueue#activateApplications()

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587972#comment-14587972
 ] 

Hudson commented on YARN-3789:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2158 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2158/])
YARN-3789. Improve logs for LeafQueue#activateApplications(). Contributed 
(devaraj: rev b039e69bb03accef485361af301fa59f03d08d6a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt


 Improve logs for LeafQueue#activateApplications() 
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588025#comment-14588025
 ] 

Hadoop QA commented on YARN-3804:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 53s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 47s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 59s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 53s | The applied patch generated  2 
new checkstyle issues (total was 2, now 4). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 34s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   1m 58s | Tests failed in 
hadoop-yarn-common. |
| | |  41m 38s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.security.TestYARNTokenIdentifier |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739849/YARN-3804.01.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b039e69 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8260/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8260/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8260/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8260/console |


This message was automatically generated.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 

[jira] [Commented] (YARN-3771) final behavior is not honored for YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH since it is a String[]

2015-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588032#comment-14588032
 ] 

Hadoop QA commented on YARN-3771:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 32s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 38s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m  8s | The applied patch generated  4 
new checkstyle issues (total was 213, now 203). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 17s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   0m 47s | Tests passed in 
hadoop-mapreduce-client-common. |
| {color:red}-1{color} | mapreduce tests | 109m 49s | Tests failed in 
hadoop-mapreduce-client-jobclient. |
| {color:green}+1{color} | yarn tests |   0m 36s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   7m  3s | Tests passed in 
hadoop-yarn-applications-distributedshell. |
| | | 163m  8s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.mapred.TestReduceFetch |
|   | hadoop.mapred.TestReduceFetchFromPartialMem |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12737924/0001-YARN-3771.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b039e69 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8259/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-mapreduce-client-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8259/artifact/patchprocess/testrun_hadoop-mapreduce-client-common.txt
 |
| hadoop-mapreduce-client-jobclient test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8259/artifact/patchprocess/testrun_hadoop-mapreduce-client-jobclient.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8259/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-applications-distributedshell test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8259/artifact/patchprocess/testrun_hadoop-yarn-applications-distributedshell.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8259/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8259/console |


This message was automatically generated.

 final behavior is not honored for 
 YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH  since it is a String[]
 

 Key: YARN-3771
 URL: https://issues.apache.org/jira/browse/YARN-3771
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Attachments: 0001-YARN-3771.patch


 i was going through some find bugs rules. One issue reported in that is 
  public static final String[] DEFAULT_YARN_APPLICATION_CLASSPATH = {
 and 
   public static final String[] 
 DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH=
 is not honoring the final qualifier. The string array contents can be re 
 assigned !
 Simple test
 {code}
 public class TestClass {
   static final String[] t = { 1, 2 };
   public static void main(String[] args) {
 System.out.println(12  10);
 String[] t1={u};
 //t = t1; // this will show compilation  error
 t (1) = t1 (1) ; // But this works
   }
 }
 {code}
 One option is to use Collections.unmodifiableList
 any thoughts ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3804:
---
Attachment: YARN-3804.01.patch

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587943#comment-14587943
 ] 

Varun Saxena commented on YARN-3804:


[~bibinchundatt], if yarn is the user daemon started with, this will also be 
fine with the patch submitted. If this is not same as the daemon user, you will 
have to configure it.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3785) Support for Resource as an argument during submitApp call in MockRM test class

2015-06-16 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588041#comment-14588041
 ] 

Sunil G commented on YARN-3785:
---

Thank you [~xgong] for reviewing and committing the patch!

 Support for Resource as an argument during submitApp call in MockRM test class
 --

 Key: YARN-3785
 URL: https://issues.apache.org/jira/browse/YARN-3785
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Sunil G
Assignee: Sunil G
Priority: Minor
 Fix For: 2.8.0

 Attachments: 0001-YARN-3785.patch, 0002-YARN-3785.patch


 Currently MockRM#submitApp supports only memory. Adding test cases to support 
 vcores so that DominentResourceCalculator can be tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587931#comment-14587931
 ] 

Bibin A Chundatt commented on YARN-3804:


[~varun_saxena] ,[~vinodkv] Also at start up {{getServiceState}} will throw 
exception

{code}
2015-06-16 16:48:38,246 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
IP=10.19.92.128 OPERATION=getServiceState   TARGET=AdminService 
RESULT=FAILURE  DESCRIPTION=Unauthorized user   PERMISSIONS=
2015-06-16 16:48:38,247 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 45021, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
10.19.92.128:53773 Call#238 Retry#0
org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
permission to call 'getServiceState'
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.getServiceStatus(AdminService.java:344)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.getServiceStatus(HAServiceProtocolServerSideTranslatorPB.java:131)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4464)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:972)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2088)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2084)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082)
2015-06-16 16:48:38,258 WARN org.apache.hadoop.security.UserGroupInformation: 
No groups available for user yarn

{code}

Should we handle this too?? 

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587932#comment-14587932
 ] 

Varun Saxena commented on YARN-3804:


Added the user which daemon starts with, in the list of Admin ACLs', so that it 
matches.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Improve logs for LeafQueue#activateApplications()

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588124#comment-14588124
 ] 

Hudson commented on YARN-3789:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #219 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/219/])
YARN-3789. Improve logs for LeafQueue#activateApplications(). Contributed 
(devaraj: rev b039e69bb03accef485361af301fa59f03d08d6a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt


 Improve logs for LeafQueue#activateApplications() 
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588127#comment-14588127
 ] 

Hudson commented on YARN-3711:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #219 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/219/])
YARN-3711. Documentation of ResourceManager HA should explain configurations 
about listen addresses. Contributed by Masatake Iwasaki. (ozawa: rev 
e8c514373f2d258663497a33ffb3b231d0743b57)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerHA.md
* hadoop-yarn-project/CHANGES.txt


 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-16 Thread Joep Rottinghuis (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joep Rottinghuis updated YARN-3706:
---
Attachment: YARN-3706-YARN-2928.014.patch

YARN-3706-YARN-2928.014.patch fixes the one FindBugs warning.

The error about FindBugs being broken on main reads like this in the console:
{noformat}
  Running findbugs in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
/home/jenkins/tools/maven/latest/bin/mvn clean test findbugs:findbugs 
-DskipTests -DhadoopPatchProcess  
/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/YARN-2928FindBugsOutputhadoop-yarn-server-timelineservice.txt
 21
Exception in thread main java.io.FileNotFoundException: 
/home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/YARN-2928FindbugsWarningshadoop-yarn-server-timelineservice.xml
 (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at 
edu.umd.cs.findbugs.SortedBugCollection.progessMonitoredInputStream(SortedBugCollection.java:1231)
at 
edu.umd.cs.findbugs.SortedBugCollection.readXML(SortedBugCollection.java:308)
at 
edu.umd.cs.findbugs.SortedBugCollection.readXML(SortedBugCollection.java:295)
at edu.umd.cs.findbugs.workflow.Filter.main(Filter.java:712)
Pre-patch YARN-2928 findbugs is broken?
{noformat}

That appears unrelated to this patch.

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588224#comment-14588224
 ] 

Hudson commented on YARN-3711:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2176 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2176/])
YARN-3711. Documentation of ResourceManager HA should explain configurations 
about listen addresses. Contributed by Masatake Iwasaki. (ozawa: rev 
e8c514373f2d258663497a33ffb3b231d0743b57)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerHA.md
* hadoop-yarn-project/CHANGES.txt


 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3810) Rest API failing when ip configured in RM address in secure https mode

2015-06-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588285#comment-14588285
 ] 

Bibin A Chundatt commented on YARN-3810:


Analysis

{{KerberosAuthenticationHandler#serverSubject}}  initialized with 
*HTTP/IP@HADOOP.COM* because {{HttpServer2#hostname}} is not resolved as 
hostname in {{HttpServer2#build()}}
{code}

 if (hostName == null) {
hostName = endpoints.get(0).getHost();
  }

{code}

Since the same is initialized as IP in {{HttpServer2#initSpnego}} 
kerberos.principal will be set as *HTTP/IP@HADOOP.COM*




 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3810
 URL: https://issues.apache.org/jira/browse/YARN-3810
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical

 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588334#comment-14588334
 ] 

Hadoop QA commented on YARN-3809:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 31s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 37s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 21s | The applied patch generated  1 
new checkstyle issues (total was 213, now 213). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 54s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 22s | Tests passed in 
hadoop-yarn-api. |
| {color:red}-1{color} | yarn tests |  50m 49s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  93m  4s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739875/YARN-3809.01.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b039e69 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8261/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8261/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8261/console |


This message was automatically generated.

 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3809.01.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3810) Rest API failing when ip configured in RM address in secure https mode

2015-06-16 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-3810:
--

 Summary: Rest API failing when ip configured in RM address in 
secure https mode
 Key: YARN-3810
 URL: https://issues.apache.org/jira/browse/YARN-3810
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical


Steps to reproduce
===
1.Configure hadoop.http.authentication.kerberos.principal as below

{code:xml}
  property
namehadoop.http.authentication.kerberos.principal/name
valueHTTP/_h...@hadoop.com/value
  /property
{code}

2. In RM web address also configure IP 
3. Startup RM 

Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
/ws/v1/cluster/info}}

*Actual*

Rest API  failing

{code}
2015-06-16 19:03:49,845 DEBUG 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
Authentication exception: GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos credentails)
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos credentails)
at 
org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
at 
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
at 
org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3807:
---
Description: 
This proposal talks about limitations of the YARN scheduling policies for SLA 
applications, and tries to solve them by YARN-3806 and the new scheduling 
policy called guaranteed capacity scheduling.
Guaranteed capacity scheduling makes guarantee to the applications that they 
can get resources under specified capacity cap in totally predictable manner. 
The application can meet SLA more easily since it is self-contained in the 
shared cluster - external uncertainties are eliminated.
For example, suppose queue A has initial capacity 100G memory, and there are 
two pending applications 1 and 2, 1’s specified capacity is 70G, 2’s specified 
capacity is 50G. Queue A may accept application 1 to run first and makes 
guarantee that 1 can get resources exponentially up to its capacity and won’t 
be preempted (if allocation of 1 is 5G in scheduling cycle N, demand is 80G, 
exponential factor is 2. In N+1, it can get 5G, in N+2, it can get 10G, in N+3, 
it can get 20G, and in N+4, it can get 30G, reach its capacity). Later, when 
the cluster is free, queue A may decide to scale up by increasing its capacity 
to 120G, so it can accept application 2 and make guarantee to it as well. Queue 
A can scale down to its initial capacity when any application completes.
Guaranteed capacity scheduling also has other features that the example doesn’t 
illustrate. See proposal for more details.

  was:
This proposal talks about limitations of the YARN scheduling policies for SLA 
applications, and tries to solve them by YARN-3806 and the new scheduling 
policy called guaranteed capacity scheduling.
Guaranteed capacity scheduling makes guarantee to the applications that they 
can get resources under specified capacity cap in totally predictable manner. 
The application can meet SLA more easily since it is self-contained in the 
shared cluster - external uncertainties are eliminated.
For example, suppose queue A has initial capacity 100G memory, and there are 
two pending applications 1 and 2, 1’s specified capacity is 70G, 2’s specified 
capacity is 50G. Queue A may accept application 1 to run first and makes 
guarantee that 1 can get resources exponentially up to its capacity and won’t 
be preempted (if allocation of 1 is 5G in scheduling cycle N, demand is 80G, 
exponential factor is 2. In N+1, it can get 5G, in N+2, it can get 10G, in N+3, 
it can get 20G, and in N+4, it can get 30G, reach its capacity). Later, when 
the cluster is free, queue A may decide to scale up by increasing its capacity 
to 120G, so it can accept application 2 and make guarantee to it as well. Queue 
A can scale down to its initial capacity when any application completes.
Guaranteed capacity scheduling also have some other features. See proposal for 
more details.


 Proposal of Guaranteed Capacity Scheduling for YARN
 ---

 Key: YARN-3807
 URL: https://issues.apache.org/jira/browse/YARN-3807
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Reporter: Wei Shao
 Attachments: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.0.pdf


 This proposal talks about limitations of the YARN scheduling policies for SLA 
 applications, and tries to solve them by YARN-3806 and the new scheduling 
 policy called guaranteed capacity scheduling.
 Guaranteed capacity scheduling makes guarantee to the applications that they 
 can get resources under specified capacity cap in totally predictable manner. 
 The application can meet SLA more easily since it is self-contained in the 
 shared cluster - external uncertainties are eliminated.
 For example, suppose queue A has initial capacity 100G memory, and there are 
 two pending applications 1 and 2, 1’s specified capacity is 70G, 2’s 
 specified capacity is 50G. Queue A may accept application 1 to run first and 
 makes guarantee that 1 can get resources exponentially up to its capacity and 
 won’t be preempted (if allocation of 1 is 5G in scheduling cycle N, demand is 
 80G, exponential factor is 2. In N+1, it can get 5G, in N+2, it can get 10G, 
 in N+3, it can get 20G, and in N+4, it can get 30G, reach its capacity). 
 Later, when the cluster is free, queue A may decide to scale up by increasing 
 its capacity to 120G, so it can accept application 2 and make guarantee to it 
 as well. Queue A can scale down to its initial capacity when any application 
 completes.
 Guaranteed capacity scheduling also has other features that the example 
 doesn’t illustrate. See proposal for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3711) Documentation of ResourceManager HA should explain configurations about listen addresses

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588184#comment-14588184
 ] 

Hudson commented on YARN-3711:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/228/])
YARN-3711. Documentation of ResourceManager HA should explain configurations 
about listen addresses. Contributed by Masatake Iwasaki. (ozawa: rev 
e8c514373f2d258663497a33ffb3b231d0743b57)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/ResourceManagerHA.md
* hadoop-yarn-project/CHANGES.txt


 Documentation of ResourceManager HA should explain configurations about 
 listen addresses
 

 Key: YARN-3711
 URL: https://issues.apache.org/jira/browse/YARN-3711
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.7.1

 Attachments: YARN-3711.002.patch, YARN-3711.003.patch


 There should be explanation about webapp address in addition to RPC address.
 AM proxy filter needs explicit definition of 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} to get proper addresses 
 in RM-HA mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Improve logs for LeafQueue#activateApplications()

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588181#comment-14588181
 ] 

Hudson commented on YARN-3789:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #228 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/228/])
YARN-3789. Improve logs for LeafQueue#activateApplications(). Contributed 
(devaraj: rev b039e69bb03accef485361af301fa59f03d08d6a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt


 Improve logs for LeafQueue#activateApplications() 
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3789) Improve logs for LeafQueue#activateApplications()

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588221#comment-14588221
 ] 

Hudson commented on YARN-3789:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2176 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2176/])
YARN-3789. Improve logs for LeafQueue#activateApplications(). Contributed 
(devaraj: rev b039e69bb03accef485361af301fa59f03d08d6a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* hadoop-yarn-project/CHANGES.txt


 Improve logs for LeafQueue#activateApplications() 
 --

 Key: YARN-3789
 URL: https://issues.apache.org/jira/browse/YARN-3789
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Minor
 Fix For: 2.8.0

 Attachments: 0001-YARN-3789.patch, 0002-YARN-3789.patch, 
 0003-YARN-3789.patch, 0004-YARN-3789.patch, 0005-YARN-3789.patch


 Duplicate logging from resource manager
 during am limit check for each application
 {code}
 015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 2015-06-09 17:32:40,019 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 not starting application as amIfStarted exceeds amLimit
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588254#comment-14588254
 ] 

MENG DING commented on YARN-1197:
-

Thanks guys for all the comments! I think we all agreed that container decrease 
request should go through RM, and decrease action will be triggered with RM-NM 
heartbeat.

For increase request and action, theoretically option (a) will have better 
performance. but we are incurring extra complexity for both YARN and 
application writers. I was wondering if we can consider option (c) which sorts 
of meet (a) and (b) in the middle:

1) AM sends increase request to RM
2) RM allocates the resource and sends the increase token to NM.
3) RM sends response to AM right away, instead of waiting for NM to confirm 
that the increase action has been completed.
4) Upon receiving the response (which indicates that the increase has been 
triggered), AM should first poll the container status to make sure that the 
increase is done before taking action to allocate new tasks.

Option (c) will save one NM-RM heartbeat cycle, and since both option (a) and 
(c) need to poll container status, their performance will be very close.

We can have option (b) enabled by default, and use a configuration parameter to 
turn on option (c) for framework like Spark.

Do you think if this is worth considering?

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3807:
---
Attachment: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.1.pdf

 Proposal of Guaranteed Capacity Scheduling for YARN
 ---

 Key: YARN-3807
 URL: https://issues.apache.org/jira/browse/YARN-3807
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Reporter: Wei Shao
 Attachments: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.0.pdf, 
 ProposalOfGuaranteedCapacitySchedulingForYARN-V1.1.pdf


 This proposal talks about limitations of the YARN scheduling policies for SLA 
 applications, and tries to solve them by YARN-3806 and the new scheduling 
 policy called guaranteed capacity scheduling.
 Guaranteed capacity scheduling makes guarantee to the applications that they 
 can get resources under specified capacity cap in totally predictable manner. 
 The application can meet SLA more easily since it is self-contained in the 
 shared cluster - external uncertainties are eliminated.
 For example, suppose queue A has initial capacity 100G memory, and there are 
 two pending applications 1 and 2, 1’s specified capacity is 70G, 2’s 
 specified capacity is 50G. Queue A may accept application 1 to run first and 
 makes guarantee that 1 can get resources exponentially up to its capacity and 
 won’t be preempted (if allocation of 1 is 5G in scheduling cycle N, demand is 
 80G, exponential factor is 2. In N+1, it can get 5G, in N+2, it can get 10G, 
 in N+3, it can get 20G, and in N+4, it can get 30G, reach its capacity). 
 Later, when the cluster is free, queue A may decide to scale up by increasing 
 its capacity to 120G, so it can accept application 2 and make guarantee to it 
 as well. Queue A can scale down to its initial capacity when any application 
 completes.
 Guaranteed capacity scheduling also has other features that the example 
 doesn’t illustrate. See proposal for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3808) Proposal of Time Based Fair Scheduling for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3808:
---
Attachment: (was: 
ProposalOfGuaranteedCapacitySchedulingForYARN-V1.1.pdf)

 Proposal of Time Based Fair Scheduling for YARN
 ---

 Key: YARN-3808
 URL: https://issues.apache.org/jira/browse/YARN-3808
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, scheduler
Reporter: Wei Shao
 Attachments: ProposalOfTimeBasedFairSchedulingForYARN-V1.0.pdf


 This proposal talks about the issues of YARN fair scheduling policy, and 
 tries to solve them by YARN-3806 and the new scheduling policy called time 
 based fair scheduling.
 Time based fair scheduling policy is proposed to enforces time based fairness 
 among users. For example, if two users share the cluster weekly, each user’s 
 fair share is half of the cluster per week. At a particular week, if the 
 first user has used the whole cluster for first half of the week, then in 
 second half of the week, second user will always have priority to use cluster 
 resources since the first user has used up its fair share of the cluster 
 already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3808) Proposal of Time Based Fair Scheduling for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3808:
---
Attachment: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.1.pdf

 Proposal of Time Based Fair Scheduling for YARN
 ---

 Key: YARN-3808
 URL: https://issues.apache.org/jira/browse/YARN-3808
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, scheduler
Reporter: Wei Shao
 Attachments: ProposalOfTimeBasedFairSchedulingForYARN-V1.0.pdf


 This proposal talks about the issues of YARN fair scheduling policy, and 
 tries to solve them by YARN-3806 and the new scheduling policy called time 
 based fair scheduling.
 Time based fair scheduling policy is proposed to enforces time based fairness 
 among users. For example, if two users share the cluster weekly, each user’s 
 fair share is half of the cluster per week. At a particular week, if the 
 first user has used the whole cluster for first half of the week, then in 
 second half of the week, second user will always have priority to use cluster 
 resources since the first user has used up its fair share of the cluster 
 already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf)

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao

 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf


 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-16 Thread Jun Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-3809:
---
Attachment: YARN-3809.01.patch

Attach a patch. Make thread pool size configurable, and default size is 50.

 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3809.01.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3807) Proposal of Guaranteed Capacity Scheduling for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3807:
---
Description: 
This proposal talks about limitations of the YARN scheduling policies for SLA 
applications, and tries to solve them by YARN-3806 and the new scheduling 
policy called guaranteed capacity scheduling.
Guaranteed capacity scheduling makes guarantee to the applications that they 
can get resources under specified capacity cap in totally predictable manner. 
The application can meet SLA more easily since it is self-contained in the 
shared cluster - external uncertainties are eliminated.
For example, suppose queue A has initial capacity 100G memory, and there are 
two pending applications 1 and 2, 1’s specified capacity is 70G, 2’s specified 
capacity is 50G. Queue A may accept application 1 to run first and makes 
guarantee that 1 can get resources exponentially up to its capacity and won’t 
be preempted (if allocation of 1 is 5G in scheduling cycle N, demand is 80G, 
exponential factor is 2. In N+1, it can get 5G, in N+2, it can get 10G, in N+3, 
it can get 20G, and in N+4, it can get 30G, reach its capacity). Later, when 
the cluster is free, queue A may decide to scale up by increasing its capacity 
to 120G, so it can accept application 2 and make guarantee to it as well. Queue 
A can scale down to its initial capacity when any application completes.
Guaranteed capacity scheduling also have some other features. See proposal for 
more details.

  was:
This proposal talks about limitations of the YARN scheduling policies for SLA 
applications, and tries to solve them by YARN-3806 and the new scheduling 
policy called guaranteed capacity scheduling.
Guaranteed capacity scheduling makes guarantee to the applications that they 
can get resources under specified capacity cap in totally predictable manner. 
The application can meet SLA more easily since it is self-contained in the 
shared cluster - external uncertainties are eliminated.


 Proposal of Guaranteed Capacity Scheduling for YARN
 ---

 Key: YARN-3807
 URL: https://issues.apache.org/jira/browse/YARN-3807
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, fairscheduler
Reporter: Wei Shao
 Attachments: ProposalOfGuaranteedCapacitySchedulingForYARN-V1.0.pdf


 This proposal talks about limitations of the YARN scheduling policies for SLA 
 applications, and tries to solve them by YARN-3806 and the new scheduling 
 policy called guaranteed capacity scheduling.
 Guaranteed capacity scheduling makes guarantee to the applications that they 
 can get resources under specified capacity cap in totally predictable manner. 
 The application can meet SLA more easily since it is self-contained in the 
 shared cluster - external uncertainties are eliminated.
 For example, suppose queue A has initial capacity 100G memory, and there are 
 two pending applications 1 and 2, 1’s specified capacity is 70G, 2’s 
 specified capacity is 50G. Queue A may accept application 1 to run first and 
 makes guarantee that 1 can get resources exponentially up to its capacity and 
 won’t be preempted (if allocation of 1 is 5G in scheduling cycle N, demand is 
 80G, exponential factor is 2. In N+1, it can get 5G, in N+2, it can get 10G, 
 in N+3, it can get 20G, and in N+4, it can get 30G, reach its capacity). 
 Later, when the cluster is free, queue A may decide to scale up by increasing 
 its capacity to 120G, so it can accept application 2 and make guarantee to it 
 as well. Queue A can scale down to its initial capacity when any application 
 completes.
 Guaranteed capacity scheduling also have some other features. See proposal 
 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588315#comment-14588315
 ] 

Hadoop QA commented on YARN-3706:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 25s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 3 new or modified test files. |
| {color:green}+1{color} | javac |   7m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 46s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 14s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m 10s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 38s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 44s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 45s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 15s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  38m 11s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739880/YARN-3706-YARN-2928.014.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / a1bb913 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8262/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8262/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8262/console |


This message was automatically generated.

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3810) Rest API failing when ip configured in RM address in secure https mode

2015-06-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3810:
---
Attachment: 0001-YARN-3810.patch

Hoping that the analysis is correct. Please review the patch uploaded 

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3810
 URL: https://issues.apache.org/jira/browse/YARN-3810
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3810.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2

2015-06-16 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588465#comment-14588465
 ] 

Naganarasimha G R commented on YARN-3792:
-

bq. I'm not sure why that is the case. Maybe they didn't want to run the base 
tests? Unless that changes, I guess we'll have to check for null. This looks 
pretty brittle however. Sigh.
Well can make {{TestDistributedShell.setupInternal}} to take additional 
argument as method name so that caller can pass the method name and its less 
brittle?

 Test case failures in TestDistributedShell and some issue fixes related to 
 ATSV2
 

 Key: YARN-3792
 URL: https://issues.apache.org/jira/browse/YARN-3792
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Attachments: YARN-3792-YARN-2928.001.patch


 # encountered [testcase 
 failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] 
 which was happening even without the patch modifications in YARN-3044
 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow
 TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow
 TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression
 # Remove unused {{enableATSV1}} in testDisstributedShell
 # container metrics needs to be published only for v2 test cases of 
 testDisstributedShell
 # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux 
 service was not configured and {{TimelineClient.putObjects}} was getting 
 invoked.
 # Race condition for the Application events to published and test case 
 verification for RM's ApplicationFinished Timeline Events
 # Application Tags for converted to lowercase in 
 ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to 
 detect to custom flow details of the app



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3804:
---
Attachment: YARN-3804.02.patch

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3810) Rest API failing when ip configured in RM address in secure https mode

2015-06-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3810:
---
Attachment: 0002-YARN-3810.patch

Uploading patch with formatting.  Sorry missed it last time. 

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3810
 URL: https://issues.apache.org/jira/browse/YARN-3810
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3810.patch, 0002-YARN-3810.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588458#comment-14588458
 ] 

Hadoop QA commented on YARN-3804:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m  5s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 35s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 29s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 33s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 58s | Tests passed in 
hadoop-yarn-common. |
| | |  38m 54s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739897/YARN-3804.02.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b039e69 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8263/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8263/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8263/console |


This message was automatically generated.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588386#comment-14588386
 ] 

Xuan Gong commented on YARN-3804:
-

[~varun_saxena]
ConfiguredYarnAuthorizer#setAdmins has been called in AdminService.
{code}
  @Override
  public void setAdmins(AccessControlList acls, UserGroupInformation ugi) {
adminAcl = acls;
  }
{code}
Could we add the logic here ?

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf)

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao

 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588478#comment-14588478
 ] 

Sandy Ryza commented on YARN-1197:
--

bq. I think this assumes cluster is quite idle, I understand the low latency 
could be achieved, but it's not guaranteed since we don't support 
oversubscribing, etc.
If the cluster is fully contended we certainly won't get this performance.  But 
as long as there is a decent chunk of space, which is common in many settings, 
we can.  The cluster doesn't need to be fully idle by any means.

More broadly, just because YARN is not good at hitting sub-second latencies 
doesn't mean that it isn't a design goal.  I strongly oppose any argument that 
uses the current slowness of YARN as a justification for why we should make 
architectural decisions that could compromise latencies.

That said, I still don't have a strong grasp on the kind of complexity we're 
introducing in the AM, so would like to try to understand that before arguing 
against you further.

Is the main problem we're grappling still the one Meng brought up here:
https://issues.apache.org/jira/browse/YARN-1197?focusedCommentId=14556803page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14556803?
I.e. that an AM can receive an increase from the RM, then issue a decrease to 
the NM, and then use its increase to get resources it doesn't deserve?

Or is the idea that, even if we didn't have this JIRA, NMClient is too 
complicated, and we'd like to reduce that?

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588487#comment-14588487
 ] 

Wangda Tan commented on YARN-1197:
--

Thanks [~mding],

I think (c) sounds like a very good proposal, it has advantages
- Latency is better than (a) (If we assume network conditions between 
AM-RM/RM-NM are same, since RM send response to NM at the same heartbeat).
- Doesn't expose container token, etc. to AM when increase approved which is 
not necessary, AM only needs to poll NM about status of changing resource.
- It can be considered as an additional step of (b). ((c) = (b) + 
rm_response_to_am_when_increase_approved + am_poll_nm_about_increase_status). 
Good for planning as well.

bq. We can have option (b) enabled by default, and use a configuration 
parameter to turn on option (c) for framework like Spark.
I think the two can be enabled together, I don't see any conflict between them, 
AM can poll NM if it doesn't want to wait another NM-RM heartbeat.

Thoughts? [~sandyr], [~vinodkv].

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3810) Rest API failing when ip configured in RM address in secure https mode

2015-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588507#comment-14588507
 ] 

Hadoop QA commented on YARN-3810:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 31s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m  6s | The applied patch generated  1 
new checkstyle issues (total was 63, now 64). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 31s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 51s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  22m 39s | Tests failed in 
hadoop-common. |
| | |  61m 55s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.fs.shell.TestCount |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739903/0001-YARN-3810.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b039e69 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8264/artifact/patchprocess/diffcheckstylehadoop-common.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8264/artifact/patchprocess/testrun_hadoop-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8264/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8264/console |


This message was automatically generated.

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3810
 URL: https://issues.apache.org/jira/browse/YARN-3810
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3810.patch, 0002-YARN-3810.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-16 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588399#comment-14588399
 ] 

Sangjin Lee commented on YARN-3706:
---

The findbugs issue seems to be showing up on all our JIRAs. I'll file a JIRA 
against this issue.

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3810) Rest API failing when ip configured in RM address in secure https mode

2015-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588532#comment-14588532
 ] 

Hadoop QA commented on YARN-3810:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 34s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 46s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m  5s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 51s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | common tests |  22m 14s | Tests failed in 
hadoop-common. |
| | |  61m 38s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.fs.shell.TestCount |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12739913/0002-YARN-3810.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / b039e69 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8265/artifact/patchprocess/testrun_hadoop-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8265/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8265/console |


This message was automatically generated.

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3810
 URL: https://issues.apache.org/jira/browse/YARN-3810
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3810.patch, 0002-YARN-3810.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf


 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588480#comment-14588480
 ] 

Xuan Gong commented on YARN-3804:
-

[~varun_saxena]
Actually, in AdminService#serviceInit, we have
{code}
authorizer.setAdmins(new AccessControlList(conf.get(
  YarnConfiguration.YARN_ADMIN_ACL,
YarnConfiguration.DEFAULT_YARN_ADMIN_ACL)), UserGroupInformation
.getCurrentUser());
{code}
, we could create a common function which will add the Daemon user into the 
AccessControlList, then pass the modified AccessControlList into this method.
In this case, we do not need to change the code for every 
YarnAuthorizationProvider, (such as ConfiguredYarnAuthorizer).

Also in AdminService#refreshAdminAcls(), we need similar changes, too


 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2

2015-06-16 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588901#comment-14588901
 ] 

Sangjin Lee commented on YARN-3792:
---

Yes that might be an idea.

 Test case failures in TestDistributedShell and some issue fixes related to 
 ATSV2
 

 Key: YARN-3792
 URL: https://issues.apache.org/jira/browse/YARN-3792
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Attachments: YARN-3792-YARN-2928.001.patch


 # encountered [testcase 
 failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] 
 which was happening even without the patch modifications in YARN-3044
 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow
 TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow
 TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression
 # Remove unused {{enableATSV1}} in testDisstributedShell
 # container metrics needs to be published only for v2 test cases of 
 testDisstributedShell
 # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux 
 service was not configured and {{TimelineClient.putObjects}} was getting 
 invoked.
 # Race condition for the Application events to published and test case 
 verification for RM's ApplicationFinished Timeline Events
 # Application Tags for converted to lowercase in 
 ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to 
 detect to custom flow details of the app



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-16 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588767#comment-14588767
 ] 

Sangjin Lee commented on YARN-3706:
---

The latest patch (14) looks good to me. I'll wait on the result of the 
standalone testing before I commit this. I'm not saying the standalone testing 
is a required part of the review/commit, but rather I'd like to see if there is 
anything we're not covering with our unit tests. If we should find anything, we 
could use it as an opportunity to increase coverage.

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588827#comment-14588827
 ] 

Wangda Tan commented on YARN-3806:
--

Hi [~wshao],
Thanks for providing thoughts about this, I took a quick look at attached 
design doc, some comments (Please correct me if I missed anything).

The JIRA wants to tackle following issues
# Pluggable preemption policy
# Be able to add other allocation policies
# Application level configuration 
# Decouple application / nodes from scheduler

#1/#2, should be already covered by YARN-3306. It doesn't include detailed 
ParentQueue policy in design doc, but we plan to extend it to ParentQueue as 
mentioned in YARN-3306. And I found preemptResource and acquireResource in 
your design are very close to what we have in CS. If you have time, could you 
take a look at YARN-3318 (which is already committed), ordering policy is part 
of queue policy, is that what you were trying to do?

#3, I'm not sure if it is a valid usecase, I can understand admin can set 
maximum-limit of application, but setting minimum-share by app sounds not a 
fair allocation.

#4, Now we have an common abstraction of application 
(SchedulerApplicationAttempt) and node (SchedulerNode) for different scheduler 
implementations. Are you suggesting eliminate scheduler-specific implementation 
such as FiCaSchedulerApp/FiCaSchedulerNode? I think that might be problematic, 
you can think the app is also a pluggable implementation. For instance, FS/CS 
have different logic in app level, such as: how to do delayed scheduling, 
limits, etc.

To details in the design doc:
- Each LeafQueue can run at most one app, that seems not like a queue. This 
is very restrictive, for example, if there's a requirement needs 1k app running 
at the same time, do you need to configure 1k LeafQueues? And how to choose 
where to submit application?
- YARN-2986 is trying to create a unified view of scheduler configuration, is 
there any overlap between YARN-2986 and configuration model mentioned in your 
doc?

Thoughts?

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, a typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3804:
---
Attachment: YARN-3804.04.patch

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1459#comment-1459
 ] 

Wangda Tan commented on YARN-1197:
--

Thanks for comment, [~sseth]/[~sandyr].

Now I'm convinced, from two downstream developers' view. +1 to do the 
AM-RM-AM-NM (a) for increase as the original doc before (b), not sure if (b) is 
really required, we can do (b) if there's any real use cases.

bq. More broadly, just because YARN is not good at hitting sub-second latencies 
doesn't mean that it isn't a design goal. I strongly oppose any argument that 
uses the current slowness of YARN as a justification for why we should make 
architectural decisions that could compromise latencies.
Make sense to me.

bq. I.e. that an AM can receive an increase from the RM, then issue a decrease 
to the NM, and then use its increase to get resources it doesn't deserve?
Yes, if we send increase request to RM, but send decrease request to NM, we 
need to handle complex inconsistency in RM side. You can take a look at latest 
design doc for more details.

bq. I don't think it's possible for the AM to start using the additional 
allocation till the NM has updated all it's state - including writing out 
recovery information for work preserving restart (Thanks Vinod for pointing 
this out). Seems like that poll/callback will be required - unless the plan is 
to route this information via the RM.
Maybe we need to wait all increase steps (monitor/cgroup/state-store) finish 
before using the additional allocation. If a container is 5G, increase to 10G, 
RM/NM crashes before write to state store, and app starts use 10G. After RM 
restart/recovery, NM/RM will think the container is 5G, that will be 
problematic.

[~mding], do you agree with doing (a)?

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3119) Memory limit check need not be enforced unless aggregate usage of all containers is near limit

2015-06-16 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588429#comment-14588429
 ] 

Chris Douglas commented on YARN-3119:
-

Systems that embrace more forgiving resource enforcement are difficult to tune, 
particularly if those jobs run in multiple environments with different 
constraints (as is common when moving from research/test to production). If 
jobs silently and implicitly use more resources than requested, then users only 
learn that their container is under-provisioned when the cluster workload 
shifts, and their pipelines start to fail.

I agree with [~aw]'s 
[feedback|https://issues.apache.org/jira/browse/YARN-3119?focusedCommentId=14303956page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14303956].
 If this workaround is committed, this should be disabled by default and 
strongly discouraged.

 Memory limit check need not be enforced unless aggregate usage of all 
 containers is near limit
 --

 Key: YARN-3119
 URL: https://issues.apache.org/jira/browse/YARN-3119
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-3119.prelim.patch


 Today we kill any container preemptively even if the total usage of 
 containers for that is well within the limit for YARN. Instead if we enforce 
 memory limit only if the total limit of all containers is close to some 
 configurable ratio of overall memory assigned to containers, we can allow for 
 flexibility in container memory usage without adverse effects. This is 
 similar in principle to how cgroups uses soft_limit_in_bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3792) Test case failures in TestDistributedShell and some issue fixes related to ATSV2

2015-06-16 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588441#comment-14588441
 ] 

Sangjin Lee commented on YARN-3792:
---

{quote}
Other approach i can think of is to make the test sleep for 500 seconds 3~4 
times or till the desired result is got (i think this is the approach which is 
followed in most of the test cases which are highly async), thoughts ?
{quote}

Given the asynchronous nature (and lack of strong coordination between 
actions), I'm +1 on looping a few times.

{quote}
Well atleast in my environment(latest ubuntu) even after JAVA_HOME was set 
properly in shell, test cases when executed from eclipse were failing because 
of the JAVA_HOME was not availble and it took a while for me to figure out 
doing this. Hence i thought it would be usefull information for others who r 
testing for the first time, If you guys feel not necessary i can remove it.
{quote}

IMO the comment seems a little out of place. I would prefer not having this 
comment here, but I'd like to hear what others think.

{quote}
yes its req, as mentioned earlier for 
TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression was 
failing because method name rule will not set be set in 
TestDistributedShell.setupInternal
{quote}

Thanks for reminding me. I didn't realize that 
{{TestDistributedShellWithNodeLabels}} does *NOT* extend 
{{TestDistributedShell}}. I'm not sure why that is the case. Maybe they didn't 
want to run the base tests? Unless that changes, I guess we'll have to check 
for null. This looks pretty brittle however. Sigh.

 Test case failures in TestDistributedShell and some issue fixes related to 
 ATSV2
 

 Key: YARN-3792
 URL: https://issues.apache.org/jira/browse/YARN-3792
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Attachments: YARN-3792-YARN-2928.001.patch


 # encountered [testcase 
 failures|https://builds.apache.org/job/PreCommit-YARN-Build/8233/testReport/] 
 which was happening even without the patch modifications in YARN-3044
 TestDistributedShell.testDSShellWithoutDomainV2CustomizedFlow
 TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow
 TestDistributedShellWithNodeLabels.testDSShellWithNodeLabelExpression
 # Remove unused {{enableATSV1}} in testDisstributedShell
 # container metrics needs to be published only for v2 test cases of 
 testDisstributedShell
 # Nullpointer was thrown in TimelineClientImpl.constructResURI when Aux 
 service was not configured and {{TimelineClient.putObjects}} was getting 
 invoked.
 # Race condition for the Application events to published and test case 
 verification for RM's ApplicationFinished Timeline Events
 # Application Tags for converted to lowercase in 
 ApplicationSubmissionContextPBimpl, hence RMTimelinecollector was not able to 
 detect to custom flow details of the app



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id

2015-06-16 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588776#comment-14588776
 ] 

Xuan Gong commented on YARN-3714:
-

Committed into trunk/branch-2. Thanks, [~iwasakims]

 AM proxy filter can not get RM webapp address from 
 yarn.resourcemanager.hostname.rm-id
 --

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588847#comment-14588847
 ] 

Siddharth Seth commented on YARN-1197:
--

bq. I would argue that waiting for an NM-RM heartbeat is much worse than 
waiting for an AM-RM heartbeat. With continuous scheduling, the RM can make 
decisions in millisecond time, and the AM can regulate its heartbeats according 
to the application's needs to get fast responses. If an NM-RM heartbeat is 
involved, the application is at the mercy of the cluster settings, which should 
be in the multi-second range for large clusters.
I tend to agree with Sandy's arguments about option a being better in terms of 
latency - and that we shouldn't be architecting this in a manner which would 
limit it to the seconds range rather than milliseconds / hundreds of 
milliseconds when possible.

It's already possible to get fast allocations - low 100s of milliseconds via a 
scheduler loop which is delinked from NM heartbeats and a variable AM-RM 
heartbeat interval, which is under user control rather than being a cluster 
property.

There are going to be improvements to the performance of various protocols in 
YARN. HADOOP-11552 opens up one such option which allows AMs to know about 
allocations as soon as the scheduler has the made the decision, without a 
requirement to poll. Of-course - there's plenty of work to be done before that 
can actually be used :)

That said, callbacks on the RPC can be applied at various levels - including 
NM-RM communication, which can make option b work fast as well. However, it 
will incur the cost of additional RPC roundtrips. Option a, however, can be 
fast from the get go with tuning, and also gets better with future enhancements.

I don't think it's possible for the AM to start using the additional allocation 
till the NM has updated all it's state - including writing out recovery 
information for work preserving restart (Thanks Vinod for pointing this out). 
Seems like that poll/callback will be required - unless the plan is to route 
this information via the RM.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588660#comment-14588660
 ] 

Varun Saxena commented on YARN-3804:


Ok, got it...Because we reload the configuration. 

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588679#comment-14588679
 ] 

MENG DING commented on YARN-1197:
-

[~leftnoteasy], if I understand it correctly, in the {{AllocateResponseProto}}, 
we will have something like {{containers_change_approved}} and 
{{containers_change_completed}}. The former will be filled with ID/capability 
of containers whose change requests have been approved by RM. The latter will 
be filled with ID/capability of containers whose resource change action have 
been completed in NM. Right? 

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588724#comment-14588724
 ] 

Varun Saxena commented on YARN-3804:


Anyhow, even that wouldn't have been proper fix because setAdmins is called 
again on refresh.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3804:
---
Attachment: YARN-3804.03.patch

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3810) Rest API failing when ip configured in RM address in secure https mode

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588733#comment-14588733
 ] 

Varun Saxena commented on YARN-3810:


If fix is in {{HttpServer2}}, the issue can be moved to HADOOP common.

 Rest API failing when ip configured in RM address in secure https mode
 --

 Key: YARN-3810
 URL: https://issues.apache.org/jira/browse/YARN-3810
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3810.patch, 0002-YARN-3810.patch


 Steps to reproduce
 ===
 1.Configure hadoop.http.authentication.kerberos.principal as below
 {code:xml}
   property
 namehadoop.http.authentication.kerberos.principal/name
 valueHTTP/_h...@hadoop.com/value
   /property
 {code}
 2. In RM web address also configure IP 
 3. Startup RM 
 Call Rest API for RM  {{ curl -i -k  --insecure --negotiate -u : https IP 
 /ws/v1/cluster/info}}
 *Actual*
 Rest API  failing
 {code}
 2015-06-16 19:03:49,845 DEBUG 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
 Authentication exception: GSSException: No valid credentials provided 
 (Mechanism level: Failed to find any Kerberos credentails)
 org.apache.hadoop.security.authentication.client.AuthenticationException: 
 GSSException: No valid credentials provided (Mechanism level: Failed to find 
 any Kerberos credentails)
   at 
 org.apache.hadoop.security.authentication.server.KerberosAuthenticationHandler.authenticate(KerberosAuthenticationHandler.java:399)
   at 
 org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:348)
   at 
 org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:519)
   at 
 org.apache.hadoop.yarn.server.security.http.RMAuthenticationFilter.doFilter(RMAuthenticationFilter.java:82)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588648#comment-14588648
 ] 

Varun Saxena commented on YARN-3804:


Oops, sorry for the mistake.
I had tested this in my local setup and removed setAdmins from AdminService. 
But forgot including changes of AdminService in patch.

Any reason explicitly calling {{setAdmins}} is required by the way ?

Anyways your suggestion makes sense to avoid issues with another auth provider. 
Will make the change.

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588671#comment-14588671
 ] 

MENG DING commented on YARN-1197:
-

[~sandyr], by processing both resource decrease and increase request through 
RM, the original problem that I brought up should not be an issue any more. 
What we are trying to grasp right now is if it is really necessary for the 
increase action to go through RM-AM-NM. IMHO, if we can eliminate the need 
for that while still achieving reasonable performance, that would be ideal.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf)

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf

Some minor changes.

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf)

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf


 Currently, A typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Description: 
Currently, a typical YARN cluster runs many different kinds of applications: 
production applications, ad hoc user applications, long running services and so 
on. Different YARN scheduling policies may be suitable for different 
applications. For example, capacity scheduling can manage production 
applications well since application can get guaranteed resource share, fair 
scheduling can manage ad hoc user applications well since it can enforce 
fairness among users. However, current YARN scheduling framework doesn’t have a 
mechanism for multiple scheduling policies work hierarchically in one cluster.

YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
proposed a per-queue policy driven framework. In detail, it supported different 
scheduling policies for leaf queues. However, support of different scheduling 
policies for upper level queues is not seriously considered yet. 

A generic scheduling framework is proposed here to address these limitations. 
It supports different policies for any queue consistently. The proposal tries 
to solve many other issues in current YARN scheduling framework as well.

  was:
Currently, A typical YARN cluster runs many different kinds of applications: 
production applications, ad hoc user applications, long running services and so 
on. Different YARN scheduling policies may be suitable for different 
applications. For example, capacity scheduling can manage production 
applications well since application can get guaranteed resource share, fair 
scheduling can manage ad hoc user applications well since it can enforce 
fairness among users. However, current YARN scheduling framework doesn’t have a 
mechanism for multiple scheduling policies work hierarchically in one cluster.

YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
proposed a per-queue policy driven framework. In detail, it supported different 
scheduling policies for leaf queues. However, support of different scheduling 
policies for upper level queues is not seriously considered yet. 

A generic scheduling framework is proposed here to address these limitations. 
It supports different policies for any queue consistently. The proposal tries 
to solve many other issues in current YARN scheduling framework as well.


 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, a typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588742#comment-14588742
 ] 

Xuan Gong commented on YARN-3804:
-

[~varun_saxena] The patch Looks good. But Could we add some testcases for this ?

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588753#comment-14588753
 ] 

Varun Saxena commented on YARN-3804:


ok

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id

2015-06-16 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588756#comment-14588756
 ] 

Xuan Gong commented on YARN-3714:
-

+1 LGTM. Will commit

 AM proxy filter can not get RM webapp address from 
 yarn.resourcemanager.hostname.rm-id
 --

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-3811:
--

 Summary: NM restarts could lead to app failures
 Key: YARN-3811
 URL: https://issues.apache.org/jira/browse/YARN-3811
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical


Consider the following scenario:
1. RM assigns a container on node N to an app A.
2. Node N is restarted
3. A tries to launch container on node N.

3 could lead to an NMNotYetReadyException depending on whether NM N has 
registered with the RM. In MR, this is considered a task attempt failure. A few 
of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3714) AM proxy filter can not get RM webapp address from yarn.resourcemanager.hostname.rm-id

2015-06-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588944#comment-14588944
 ] 

Hudson commented on YARN-3714:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8028 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8028/])
YARN-3714. AM proxy filter can not get RM webapp address from (xgong: rev 
e27d5a13b0623e3eb43ac773eccd082b9d6fa9d0)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/RMHAUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/amfilter/TestAmFilterInitializer.java


 AM proxy filter can not get RM webapp address from 
 yarn.resourcemanager.hostname.rm-id
 --

 Key: YARN-3714
 URL: https://issues.apache.org/jira/browse/YARN-3714
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Masatake Iwasaki
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-3714.001.patch, YARN-3714.002.patch, 
 YARN-3714.003.patch, YARN-3714.004.patch


 Default proxy address could not be got without setting 
 {{yarn.resourcemanager.webapp.address._rm-id_}} and/or 
 {{yarn.resourcemanager.webapp.https.address._rm-id_}} explicitly if RM-HA is 
 enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1983) Support heterogeneous container types at runtime on YARN

2015-06-16 Thread Sidharta Seethana (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588954#comment-14588954
 ] 

Sidharta Seethana commented on YARN-1983:
-

Hi [~chenchun]

[~ashahab] and I have been thinking on how best to take ‘container type’ 
support forward in a way that leverages existing functionality ( e.g 
security/resource isolation ) in a given executor (e.g LinuxContainerExecutor) 
and yet provide flexibility to let users pick different container types ( e.g 
Docker v/s non-Docker ). To this end, we came up with the notion of container 
runtimes that can be used within the same executor. Given this, I feel that we 
no longer need to go down the path of 'composite executors'. Please see the 
following for more information :

* YARN-3611 which describes Docker support in LinuxContainerExecutor
* Our presentation/demo at the recent hadoop summit at how this functionality 
can be used : https://prezi.com/2mxvb0n_q1rt/yarn-and-the-docker-ecosystem/

We have been working on patches to enable this functionality and we expect to 
submit these for review soon (within the next week).  

thanks,
-Sidharta 

 Support heterogeneous container types at runtime on YARN
 

 Key: YARN-1983
 URL: https://issues.apache.org/jira/browse/YARN-1983
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junping Du
 Attachments: YARN-1983.2.patch, YARN-1983.patch


 Different container types (default, LXC, docker, VM box, etc.) have different 
 semantics on isolation of security, namespace/env, performance, etc.
 Per discussions in YARN-1964, we have some good thoughts on supporting 
 different types of containers running on YARN and specified by application at 
 runtime which largely enhance YARN's flexibility to meet heterogenous app's 
 requirement on isolation at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588984#comment-14588984
 ] 

Karthik Kambatla commented on YARN-3811:


We ran into this in our rolling upgrade tests. 

 NM restarts could lead to app failures
 --

 Key: YARN-3811
 URL: https://issues.apache.org/jira/browse/YARN-3811
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Consider the following scenario:
 1. RM assigns a container on node N to an app A.
 2. Node N is restarted
 3. A tries to launch container on node N.
 3 could lead to an NMNotYetReadyException depending on whether NM N has 
 registered with the RM. In MR, this is considered a task attempt failure. A 
 few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1948) Expose utility methods in Apps.java publically

2015-06-16 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588992#comment-14588992
 ] 

Vinod Kumar Vavilapalli commented on YARN-1948:
---

Tx for taking this up, [~nijel]. Quick comment - I think we should make them 
more public-api like than how they are now. For e.g, the names look like they 
are internal methods, YARN API talking about classpath is breaking abstractions 
etc.

 Expose utility methods in Apps.java publically
 --

 Key: YARN-1948
 URL: https://issues.apache.org/jira/browse/YARN-1948
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Affects Versions: 2.4.0
Reporter: Sandy Ryza
Assignee: nijel
  Labels: newbie
 Attachments: YARN-1948-1.patch


 Apps.setEnvFromInputString and Apps.addToEnvironment are methods used by 
 MapReduce, Spark, and Tez that are currently marked private.  As these are 
 useful for any YARN app that wants to allow users to augment container 
 environments, it would be helpful to make them public.
 It may make sense to put them in a new class with a better name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588995#comment-14588995
 ] 

Karthik Kambatla commented on YARN-3811:


The issue is with counting container-launch-failures against the 4 task 
failures. We could potentially go about this in different ways:
# Support retries when launching containers. Start/stop containers are 
@AtMostOnce operations. This works okay for NM restart cases. When an NM goes 
down, this will lead to the job waiting longer before trying another node.
# On failure to launch container, return an error code that explicitly 
annotates it as a system error and not a user error. The AMs could choose to 
not count system errors against number of task attempt failures. 
# Without any changes in Yarn, MR should identify exceptions on 
startContainers() different from failures captured in 
StartContainersResponse#getFailedRequests. That is, NMNotYetReadyException and 
IOException will not be counted against the number of allowed failures. 

Option 2 seems like a cleaner approach to me. 

 NM restarts could lead to app failures
 --

 Key: YARN-3811
 URL: https://issues.apache.org/jira/browse/YARN-3811
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Consider the following scenario:
 1. RM assigns a container on node N to an app A.
 2. Node N is restarted
 3. A tries to launch container on node N.
 3 could lead to an NMNotYetReadyException depending on whether NM N has 
 registered with the RM. In MR, this is considered a task attempt failure. A 
 few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589001#comment-14589001
 ] 

MENG DING commented on YARN-1197:
-

Sorry got things messed up. 

Correction:

We definitely need {{AllocateResponseProto}} for container increase token. For 
decrease result, it is optional, but probably it doesn't hurt to set it anyway.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3069) Document missing properties in yarn-default.xml

2015-06-16 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589005#comment-14589005
 ] 

Akira AJISAKA commented on YARN-3069:
-

Thanks [~rchiang] for updating the patch.

bq. I'll also go through the yarn-default.xml file once more to make sure no 
default values will change.
Thanks. I checked the file and found 3 issues.

bq. yarn.log-aggregation-status.time-out.ms
The default value is 60, not 6.

bq. yarn.nodemanager.webapp.https.address
The default value is 0.0.0.0:8044

bq. yarn.intermediate-data-encryption.enable
The default value is false.

I'm +1 if these are addressed.

 Document missing properties in yarn-default.xml
 ---

 Key: YARN-3069
 URL: https://issues.apache.org/jira/browse/YARN-3069
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Ray Chiang
Assignee: Ray Chiang
  Labels: BB2015-05-TBR, supportability
 Attachments: YARN-3069.001.patch, YARN-3069.002.patch, 
 YARN-3069.003.patch, YARN-3069.004.patch, YARN-3069.005.patch, 
 YARN-3069.006.patch, YARN-3069.007.patch, YARN-3069.008.patch, 
 YARN-3069.009.patch, YARN-3069.010.patch, YARN-3069.011.patch


 The following properties are currently not defined in yarn-default.xml.  
 These properties should either be
   A) documented in yarn-default.xml OR
   B)  listed as an exception (with comments, e.g. for internal use) in the 
 TestYarnConfigurationFields unit test
 Any comments for any of the properties below are welcome.
   org.apache.hadoop.yarn.server.sharedcachemanager.RemoteAppChecker
   org.apache.hadoop.yarn.server.sharedcachemanager.store.InMemorySCMStore
   security.applicationhistory.protocol.acl
   yarn.app.container.log.backups
   yarn.app.container.log.dir
   yarn.app.container.log.filesize
   yarn.client.app-submission.poll-interval
   yarn.client.application-client-protocol.poll-timeout-ms
   yarn.is.minicluster
   yarn.log.server.url
   yarn.minicluster.control-resource-monitoring
   yarn.minicluster.fixed.ports
   yarn.minicluster.use-rpc
   yarn.node-labels.fs-store.retry-policy-spec
   yarn.node-labels.fs-store.root-dir
   yarn.node-labels.manager-class
   yarn.nodemanager.container-executor.os.sched.priority.adjustment
   yarn.nodemanager.container-monitor.process-tree.class
   yarn.nodemanager.disk-health-checker.enable
   yarn.nodemanager.docker-container-executor.image-name
   yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
   yarn.nodemanager.linux-container-executor.group
   yarn.nodemanager.log.deletion-threads-count
   yarn.nodemanager.user-home-dir
   yarn.nodemanager.webapp.https.address
   yarn.nodemanager.webapp.spnego-keytab-file
   yarn.nodemanager.webapp.spnego-principal
   yarn.nodemanager.windows-secure-container-executor.group
   yarn.resourcemanager.configuration.file-system-based-store
   yarn.resourcemanager.delegation-token-renewer.thread-count
   yarn.resourcemanager.delegation.key.update-interval
   yarn.resourcemanager.delegation.token.max-lifetime
   yarn.resourcemanager.delegation.token.renew-interval
   yarn.resourcemanager.history-writer.multi-threaded-dispatcher.pool-size
   yarn.resourcemanager.metrics.runtime.buckets
   yarn.resourcemanager.nm-tokens.master-key-rolling-interval-secs
   yarn.resourcemanager.reservation-system.class
   yarn.resourcemanager.reservation-system.enable
   yarn.resourcemanager.reservation-system.plan.follower
   yarn.resourcemanager.reservation-system.planfollower.time-step
   yarn.resourcemanager.rm.container-allocation.expiry-interval-ms
   yarn.resourcemanager.webapp.spnego-keytab-file
   yarn.resourcemanager.webapp.spnego-principal
   yarn.scheduler.include-port-in-node-name
   yarn.timeline-service.delegation.key.update-interval
   yarn.timeline-service.delegation.token.max-lifetime
   yarn.timeline-service.delegation.token.renew-interval
   yarn.timeline-service.generic-application-history.enabled
   
 yarn.timeline-service.generic-application-history.fs-history-store.compression-type
   yarn.timeline-service.generic-application-history.fs-history-store.uri
   yarn.timeline-service.generic-application-history.store-class
   yarn.timeline-service.http-cross-origin.enabled
   yarn.tracking.url.generator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589016#comment-14589016
 ] 

Vinod Kumar Vavilapalli commented on YARN-3811:
---

This is a long standing issue - we added the exception in YARN-562.

I think that instead of blanket retries (solution #1) above, the right solution 
is for clients to retry NMNotYetReadyException. We can do that in NMClient 
library for java clients? /cc [~jianhe]

 NM restarts could lead to app failures
 --

 Key: YARN-3811
 URL: https://issues.apache.org/jira/browse/YARN-3811
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Consider the following scenario:
 1. RM assigns a container on node N to an app A.
 2. Node N is restarted
 3. A tries to launch container on node N.
 3 could lead to an NMNotYetReadyException depending on whether NM N has 
 registered with the RM. In MR, this is considered a task attempt failure. A 
 few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588955#comment-14588955
 ] 

Wei Shao commented on YARN-3806:


Hi Wangda Tan,

Thanks for reading the proposal! Some quick replies.

1. The queue which the applications submit to will create leaf queue for each 
accepted application in runtime, application won't submit to leaf queue 
directly. With the concept of 'single application queue', scheduling framework 
can manage queues and applications (since application is also a queue) in tree 
consistently. The scheduler even doesn't need know the concept of application. 
So the proposed preemption model (support preemption among applications in same 
queue), resource allocation model, and configuration model (support application 
level configuration) can be implemented consistently. This is one of the design 
difference between YARN-3306 and this proposal.

2. Application level configuration. Yes, application specific minimalShare for 
fair scheduling isn't useful in terms of fairness. However, there are some 
other cases it might be useful.
   a. In configuration section for the queue with fair scheduling, there could 
be a template configuration for all single application queues it will create in 
runtime, and minimalThreshold can be specified there. Then minimalShare for 
each application is minimalThreshold*fairShare and updated at runtime. 
Preemption model can preempt resources between applications according to 
minimalShare while respecting fairness.
   b. For the queue with capacity scheduling or scheduling policy proposed in 
YARN-3807, each application certainly can have its own capacity to meeting SLA.
By having 'single application queue', all application configurations can be 
respected consistently.

Regarding your other comments, I will take a look at the jiras you mentioned 
and reply later. Thanks!


 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, a typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread MENG DING (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588963#comment-14588963
 ] 

MENG DING commented on YARN-1197:
-

[~leftnoteasy], I am certainly OK doing (a). My original frustration was mainly 
about inconsistency in RM when doing decrease through NM, now that we have all 
agreed that decrease should go through RM, the problem is gone.

So here is the latest proposal:

* Container resource decrease:
AM - RM - NM
* Container resource increase:
AM - RM - AM(token) - NM. AM needs to poll status of container before using 
the additional allocation.
Of course we need to properly handle token expiration (i.e., NM - RM 
communication is needed to unregister the container from the expirer).

In addition, I do *not* see a need for any response to be set in the 
{{AllocateResponseProto}}:
* For resource decrease, we can assume it is always successful. 
* For resource increase, we are now doing polling to see if the increase is 
successful.

Let me know if this makes sense.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-06-16 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588965#comment-14588965
 ] 

Vinod Kumar Vavilapalli commented on YARN-1197:
---

bq. I don't think it's possible for the AM to start using the additional 
allocation till the NM has updated all it's state - including writing out 
recovery information for work preserving restart (Thanks Vinod for pointing 
this out). Seems like that poll/callback will be required - unless the plan is 
to route this information via the RM.
We could just use the existing getContainerStatus() API for doing this polling 
for now.

 Support changing resources of an allocated container
 

 Key: YARN-1197
 URL: https://issues.apache.org/jira/browse/YARN-1197
 Project: Hadoop YARN
  Issue Type: Task
  Components: api, nodemanager, resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Wangda Tan
 Attachments: YARN-1197 old-design-docs-patches-for-reference.zip, 
 YARN-1197_Design.pdf


 The current YARN resource management logic assumes resource allocated to a 
 container is fixed during the lifetime of it. When users want to change a 
 resource 
 of an allocated container the only way is releasing it and allocating a new 
 container with expected size.
 Allowing run-time changing resources of an allocated container will give us 
 better control of resource usage in application side



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf

Some minor updates to proposal.

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, a typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3806) Proposal of Generic Scheduling Framework for YARN

2015-06-16 Thread Wei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Shao updated YARN-3806:
---
Attachment: (was: ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf)

 Proposal of Generic Scheduling Framework for YARN
 -

 Key: YARN-3806
 URL: https://issues.apache.org/jira/browse/YARN-3806
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Reporter: Wei Shao
 Attachments: ProposalOfGenericSchedulingFrameworkForYARN-V1.0.pdf, 
 ProposalOfGenericSchedulingFrameworkForYARN-V1.1.pdf


 Currently, a typical YARN cluster runs many different kinds of applications: 
 production applications, ad hoc user applications, long running services and 
 so on. Different YARN scheduling policies may be suitable for different 
 applications. For example, capacity scheduling can manage production 
 applications well since application can get guaranteed resource share, fair 
 scheduling can manage ad hoc user applications well since it can enforce 
 fairness among users. However, current YARN scheduling framework doesn’t have 
 a mechanism for multiple scheduling policies work hierarchically in one 
 cluster.
 YARN-3306 talked about many issues of today’s YARN scheduling framework, and 
 proposed a per-queue policy driven framework. In detail, it supported 
 different scheduling policies for leaf queues. However, support of different 
 scheduling policies for upper level queues is not seriously considered yet. 
 A generic scheduling framework is proposed here to address these limitations. 
 It supports different policies for any queue consistently. The proposal tries 
 to solve many other issues in current YARN scheduling framework as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589117#comment-14589117
 ] 

Bibin A Chundatt commented on YARN-3812:


{{RawLocalFileSystem#mkOneDirWithMode}} applyUMask is failing. In 
{{RollingLevelDBTimelineStore}}
{code}
  static final FsPermission LEVELDB_DIR_UMASK = FsPermission
  .createImmutable((short) 0700);
{code}
change to 
 {code}static final FsPermission LEVELDB_DIR_UMASK =  new 
FsPermission((short)0700);
 {code}
will work.  [~rkanter] can we change as above?? any comments ?

 TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
 --

 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter

 {{TestRollingLevelDBTimelineStore}} is failing with the below errors in 
 trunk.  I did a git bisect and found that it was due to HADOOP-11347, which 
 changed something with umasks in {{FsPermission}}.
 {noformat}
 Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 1.533 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.085 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 

[jira] [Commented] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-16 Thread Robert Kanter (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589121#comment-14589121
 ] 

Robert Kanter commented on YARN-3812:
-

[~bibinchundatt], good find!  That fixes the test.  Would you like to assign 
this JIRA to yourself and post a patch with the change?

 TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
 --

 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter

 {{TestRollingLevelDBTimelineStore}} is failing with the below errors in 
 trunk.  I did a git bisect and found that it was due to HADOOP-11347, which 
 changed something with umasks in {{FsPermission}}.
 {noformat}
 Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 1.533 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.085 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testGetEntitiesWithPrimaryFilters(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   

[jira] [Assigned] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt reassigned YARN-3812:
--

Assignee: Bibin A Chundatt

 TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
 --

 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter
Assignee: Bibin A Chundatt

 {{TestRollingLevelDBTimelineStore}} is failing with the below errors in 
 trunk.  I did a git bisect and found that it was due to HADOOP-11347, which 
 changed something with umasks in {{FsPermission}}.
 {noformat}
 Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 1.533 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.085 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testGetEntitiesWithPrimaryFilters(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.061 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 

[jira] [Updated] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-16 Thread Bibin A Chundatt (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-3812:
---
Attachment: 0001-YARN-3812.patch

 TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
 --

 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3812.patch


 {{TestRollingLevelDBTimelineStore}} is failing with the below errors in 
 trunk.  I did a git bisect and found that it was due to HADOOP-11347, which 
 changed something with umasks in {{FsPermission}}.
 {noformat}
 Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 1.533 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.085 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testGetEntitiesWithPrimaryFilters(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.061 sec   ERROR!
 

[jira] [Updated] (YARN-3645) ResourceManager can't start success if attribute value of aclSubmitApps is null in fair-scheduler.xml

2015-06-16 Thread Gabor Liptak (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Liptak updated YARN-3645:
---
Attachment: YARN-3645.3.patch

 ResourceManager can't start success if  attribute value of aclSubmitApps is 
 null in fair-scheduler.xml
 

 Key: YARN-3645
 URL: https://issues.apache.org/jira/browse/YARN-3645
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.2
Reporter: zhoulinlin
 Attachments: YARN-3645.1.patch, YARN-3645.2.patch, YARN-3645.3.patch, 
 YARN-3645.patch


 The aclSubmitApps is configured in fair-scheduler.xml like below:
 queue name=mr
 aclSubmitApps/aclSubmitApps
  /queue
 The resourcemanager log:
 2015-05-14 12:59:48,623 INFO org.apache.hadoop.service.AbstractService: 
 Service ResourceManager failed in state INITED; cause: 
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
 to initialize FairScheduler
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
 to initialize FairScheduler
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:493)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:920)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:240)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1159)
 Caused by: java.io.IOException: Failed to initialize FairScheduler
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1301)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1318)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 7 more
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:458)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:337)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1299)
   ... 9 more
 2015-05-14 12:59:48,623 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning 
 to standby state
 2015-05-14 12:59:48,623 INFO 
 com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory: plugin 
 transitionToStandbyIn
 2015-05-14 12:59:48,623 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service ResourceManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory.transitionToStandbyIn(YarnPlatformPluginProxyFactory.java:71)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:997)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1058)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1159)
 2015-05-14 12:59:48,623 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
 ResourceManager
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
 to initialize FairScheduler
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 

[jira] [Commented] (YARN-3645) ResourceManager can't start success if attribute value of aclSubmitApps is null in fair-scheduler.xml

2015-06-16 Thread Gabor Liptak (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589141#comment-14589141
 ] 

Gabor Liptak commented on YARN-3645:


Uploaded same patch to validate if still applies correctly and to see see the 
checkstyle error.

 ResourceManager can't start success if  attribute value of aclSubmitApps is 
 null in fair-scheduler.xml
 

 Key: YARN-3645
 URL: https://issues.apache.org/jira/browse/YARN-3645
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.5.2
Reporter: zhoulinlin
 Attachments: YARN-3645.1.patch, YARN-3645.2.patch, YARN-3645.3.patch, 
 YARN-3645.patch


 The aclSubmitApps is configured in fair-scheduler.xml like below:
 queue name=mr
 aclSubmitApps/aclSubmitApps
  /queue
 The resourcemanager log:
 2015-05-14 12:59:48,623 INFO org.apache.hadoop.service.AbstractService: 
 Service ResourceManager failed in state INITED; cause: 
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
 to initialize FairScheduler
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
 to initialize FairScheduler
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:493)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:920)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:240)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1159)
 Caused by: java.io.IOException: Failed to initialize FairScheduler
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1301)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1318)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 7 more
 Caused by: java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:458)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:337)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1299)
   ... 9 more
 2015-05-14 12:59:48,623 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning 
 to standby state
 2015-05-14 12:59:48,623 INFO 
 com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory: plugin 
 transitionToStandbyIn
 2015-05-14 12:59:48,623 WARN org.apache.hadoop.service.AbstractService: When 
 stopping the service ResourceManager : java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 com.zte.zdh.platformplugin.factory.YarnPlatformPluginProxyFactory.transitionToStandbyIn(YarnPlatformPluginProxyFactory.java:71)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:997)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1058)
   at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
   at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
   at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:171)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1159)
 2015-05-14 12:59:48,623 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
 ResourceManager
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Failed 
 to initialize FairScheduler
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 

[jira] [Commented] (YARN-3812) TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347

2015-06-16 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589154#comment-14589154
 ] 

Bibin A Chundatt commented on YARN-3812:


Please review patch uploaded

 TestRollingLevelDBTimelineStore fails in trunk due to HADOOP-11347
 --

 Key: YARN-3812
 URL: https://issues.apache.org/jira/browse/YARN-3812
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 3.0.0
Reporter: Robert Kanter
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3812.patch


 {{TestRollingLevelDBTimelineStore}} is failing with the below errors in 
 trunk.  I did a git bisect and found that it was due to HADOOP-11347, which 
 changed something with umasks in {{FsPermission}}.
 {noformat}
 Running org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 Tests run: 16, Failures: 0, Errors: 16, Skipped: 0, Time elapsed: 2.65 sec 
  FAILURE! - in 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore
 testGetDomains(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 1.533 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testRelatingToNonExistingEntity(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.085 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testValidateConfig(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 0.07 sec   ERROR!
 java.lang.UnsupportedOperationException: null
   at 
 org.apache.hadoop.fs.permission.FsPermission$ImmutableFsPermission.applyUMask(FsPermission.java:380)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:496)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:551)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:529)
   at 
 org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:314)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.initFileSystem(RollingLevelDB.java:207)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDB.init(RollingLevelDB.java:200)
   at 
 org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.serviceInit(RollingLevelDBTimelineStore.java:321)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore.setup(TestRollingLevelDBTimelineStore.java:65)
 testGetEntitiesWithPrimaryFilters(org.apache.hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore)
   Time elapsed: 

[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-16 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589164#comment-14589164
 ] 

Jun Gong commented on YARN-3809:


The checkstyle error is : YarnConfiguration.java: File length is 2,025 lines 
(max allowed is 2,000). Need fix it?

Test case error is addressed in YARN-3790.

 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3809.01.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-16 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589305#comment-14589305
 ] 

Joep Rottinghuis commented on YARN-3706:


Yeah, we can create an entity sub-package under storage.
Are you envisioning any other tables, column families, columns and 
columnprefixes to go into this package as well, or are you imagining separate 
sub-packages for separate tables?

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-16 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589287#comment-14589287
 ] 

Zhijie Shen commented on YARN-3706:
---

I didn't check the patch details, but according to this code refactor work, I 
suggest moving Entity, the entity table related classes, to 
{{org/apache/hadoop/yarn/server/timelineservice/storage/entity}}. Thoughts?

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to address the hierarchy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3706) Generalize native HBase writer for additional tables

2015-06-16 Thread Joep Rottinghuis (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589255#comment-14589255
 ] 

Joep Rottinghuis commented on YARN-3706:


Good call on the testing.

Interesting restults. Loading actual history files works fine in local pseudo 
distributed mode.
However, running the load tool errors out. I'll have to dive into what's going 
on there.

{noformat}
15/06/16 20:24:21 ERROR mapred.SimpleEntityWriter: writing to the timeline 
service failed
org.codehaus.jackson.JsonParseException: Unrecognized token 'foo_event_id': was 
expecting 'null', 'true', 'false' or NaN
 at [Source: [B@7bd9729f; line: 1, column: 14]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at 
org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at 
org.codehaus.jackson.impl.Utf8StreamParser._reportInvalidToken(Utf8StreamParser.java:2274)
at 
org.codehaus.jackson.impl.Utf8StreamParser._matchToken(Utf8StreamParser.java:2232)
at 
org.codehaus.jackson.impl.Utf8StreamParser._nextTokenNotInObject(Utf8StreamParser.java:584)
at 
org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:492)
at 
org.codehaus.jackson.map.ObjectReader._initForReading(ObjectReader.java:828)
at 
org.codehaus.jackson.map.ObjectReader._bindAndClose(ObjectReader.java:752)
at 
org.codehaus.jackson.map.ObjectReader.readValue(ObjectReader.java:486)
at 
org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:93)
at 
org.apache.hadoop.yarn.server.timeline.GenericObjectMapper.read(GenericObjectMapper.java:77)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.storeEvents(HBaseTimelineWriterImpl.java:197)
at 
org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:99)
at 
org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntities(TimelineCollector.java:93)
at 
org.apache.hadoop.mapred.SimpleEntityWriter.writeEntities(SimpleEntityWriter.java:118)
at 
org.apache.hadoop.mapred.TimelineServicePerformanceV2$EntityWriter.map(TimelineServicePerformanceV2.java:220)
at 
org.apache.hadoop.mapred.TimelineServicePerformanceV2$EntityWriter.map(TimelineServicePerformanceV2.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:244)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

It appears that the value that the GenericObjectMapper.read is reading is 
interpreted as a string, which probably depends on the value of the key 
created. I'll try to repro this in a unit test.
I'm not even quite sure why I'm using a GenericObjectMapper there and not 
simply Bytes.readString which is certainly the correct thing to do. Seems that 
indeed the test case didn't contain enough values.

I think once the reader is there it should be easier to crank through a bunch 
of entities through the combined write/read path and assert that they are equal.

 Generalize native HBase writer for additional tables
 

 Key: YARN-3706
 URL: https://issues.apache.org/jira/browse/YARN-3706
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Joep Rottinghuis
Assignee: Joep Rottinghuis
Priority: Minor
 Attachments: YARN-3706-YARN-2928.001.patch, 
 YARN-3706-YARN-2928.010.patch, YARN-3706-YARN-2928.011.patch, 
 YARN-3706-YARN-2928.012.patch, YARN-3706-YARN-2928.013.patch, 
 YARN-3706-YARN-2928.014.patch, YARN-3726-YARN-2928.002.patch, 
 YARN-3726-YARN-2928.003.patch, YARN-3726-YARN-2928.004.patch, 
 YARN-3726-YARN-2928.005.patch, YARN-3726-YARN-2928.006.patch, 
 YARN-3726-YARN-2928.007.patch, YARN-3726-YARN-2928.008.patch, 
 YARN-3726-YARN-2928.009.patch


 When reviewing YARN-3411 we noticed that we could change the class hierarchy 
 a little in order to accommodate additional tables easily.
 In order to get ready for benchmark testing we left the original layout in 
 place, as performance would not be impacted by the code hierarchy.
 Here is a separate jira to 

[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.

2015-06-16 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589294#comment-14589294
 ] 

Rohith commented on YARN-2305:
--

Updated the duplicated id link.

 When a container is in reserved state then total cluster memory is displayed 
 wrongly.
 -

 Key: YARN-2305
 URL: https://issues.apache.org/jira/browse/YARN-2305
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.1
Reporter: J.Andreina
Assignee: Sunil G
 Attachments: Capture.jpg


 ENV Details:
 =  
  3 queues  :  a(50%),b(25%),c(25%) --- All max utilization is set to 
 100
  2 Node cluster with total memory as 16GB
 TestSteps:
 =
   Execute following 3 jobs with different memory configurations for 
 Map , reducer and AM task
   ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 
 /dir8 /preempt_85 (application_1405414066690_0023)
  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 
 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 
 /dir2 /preempt_86 (application_1405414066690_0025)
  
  ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c 
 -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 
 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 
 /dir2 /preempt_62
 Issue
 =
   when 2GB memory is in reserved state  totoal memory is shown as 
 15GB and used as 15GB  ( while total memory is 16GB)
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang

2015-06-16 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589338#comment-14589338
 ] 

Devaraj K commented on YARN-3809:
-

bq. The checkstyle error is : YarnConfiguration.java: File length is 2,025 
lines (max allowed is 2,000). Need fix it?

bq. The applied patch generated 1 new checkstyle issues (total was 213, now 
213).

I think you don't need to fix this checkstyle as part of this issue as the 
checkstyle count is same after the patch, and the new checkstyle which is 
showing about file length would probably take reasonable effort to refactor it 
and exists without the patch as well.

 Failed to launch new attempts because ApplicationMasterLauncher's threads all 
 hang
 --

 Key: YARN-3809
 URL: https://issues.apache.org/jira/browse/YARN-3809
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3809.01.patch


 ApplicationMasterLauncher create a thread pool whose size is 10 to deal with 
 AMLauncherEventType(LAUNCH and CLEANUP).
 In our cluster, there was many NM with 10+ AM running on it, and one shut 
 down for some reason. After RM found the NM LOST, it cleaned up AMs running 
 on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. 
 ApplicationMasterLauncher's thread pool would be filled up, and they all hang 
 in the code containerMgrProxy.stopContainers(stopRequest) because NM was 
 down, the default RPC time out is 15 mins. It means that in 15 mins 
 ApplicationMasterLauncher could not handle new event such as LAUNCH, then new 
 attempts will fails to launch because of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3798) RM shutdown with NoNode exception while updating appAttempt on zk

2015-06-16 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589253#comment-14589253
 ] 

Vinod Kumar Vavilapalli commented on YARN-3798:
---

Trying to follow the discussion so far.

Seems like we couldn't really get to the bottom of the original issue and are 
fixing related but not the same issues. If my understanding is correct, someone 
should edit the title.

Coming to the patch: By definition, CONNECTIONLOSS also means that we should 
recreate the connection?

bq.  2. [ZKRMStateStore] Failing to zkClient.close() in 
ZKRMStateStore#createConnection, but IOException is ignored.
I think this should be fixed in ZooKeeper. No amount of patching in YARN will 
fix this.

 RM shutdown with NoNode exception while updating appAttempt on zk
 -

 Key: YARN-3798
 URL: https://issues.apache.org/jira/browse/YARN-3798
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Blocker
 Attachments: RM.log, YARN-3798-branch-2.7.patch


 RM going down with NoNode exception during create of znode for appattempt
 *Please find the exception logs*
 {code}
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session connected
 2015-06-09 10:09:44,732 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 ZKRMStateStore Session restored
 2015-06-09 10:09:44,886 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
   at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
   at java.lang.Thread.run(Thread.java:745)
 2015-06-09 10:09:44,887 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 2015-06-09 10:09:44,887 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 updating appAttempt: appattempt_1433764310492_7152_01
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at 

[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589257#comment-14589257
 ] 

Varun Saxena commented on YARN-3804:


Test failures are related. Will fix them

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3804) Both RM are on standBy state when kerberos user not in yarn.admin.acl

2015-06-16 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589258#comment-14589258
 ] 

Varun Saxena commented on YARN-3804:


Test failures are related. Will fix them

 Both RM are on standBy state when kerberos user not in yarn.admin.acl
 -

 Key: YARN-3804
 URL: https://issues.apache.org/jira/browse/YARN-3804
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: Suse 11 Sp3, 2 RM, Secure
Reporter: Bibin A Chundatt
Assignee: Varun Saxena
Priority: Critical
 Attachments: YARN-3804.01.patch, YARN-3804.02.patch, 
 YARN-3804.03.patch, YARN-3804.04.patch


 Steps to reproduce
 
 1. Configure cluster in secure mode
 2. On  RM Configure yarn.admin.acl=dsperf
 3. Configure in arn.resourcemanager.principal=yarn
 4. Start Both RM 
 Both RM will be in Standby forever
 {code}
 2015-06-15 12:20:21,556 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn 
 OPERATION=refreshAdminAcls  TARGET=AdminService RESULT=FAILURE  
 DESCRIPTION=Unauthorized userPERMISSIONS=
 2015-06-15 12:20:21,556 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:645)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:518)
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Can not execute 
 refreshAdminAcls
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:297)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 ... 4 more
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 org.apache.hadoop.security.AccessControlException: User yarn doesn't have 
 permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:230)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:465)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:295)
 ... 5 more
 Caused by: org.apache.hadoop.security.AccessControlException: User yarn 
 doesn't have permission to call 'refreshAdminAcls'
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:182)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:148)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:223)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:228)
 ... 7 more
 {code}
 *Analysis*
 On each RM attempt to switch to Active refreshACl is called and acl 
 permission not available for the user
 Infinite retry for the same switch to Active and always false returned from 
 {{ActiveStandbyElector#becomeActive()}}
  
 *Expected*
 RM should get shutdown event after few retry or even at first attempt
 Since at runtime user from which it retries for refreshacl can never be 
 updated.
 *States from commands*
  ./yarn rmadmin -getServiceState rm2
 *standby*
  ./yarn rmadmin -getServiceState rm1
 *standby*
  ./yarn rmadmin -checkHealth rm1
 *echo $? = 0*
  ./yarn rmadmin -checkHealth rm2
 *echo $? = 0*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589256#comment-14589256
 ] 

Jian He commented on YARN-3811:
---

I'm actually thinking do we still need the NMNotYetReadyException.. the 
NMNotYetReadyException is currently thrown when NM starts the service but not 
yet register/re-register with RM.  it may be ok to just launch the container. 

1. For work-preserving NM restart(scenario in this jira), I think it's ok to 
just launch the container instead of throwing exception.
2. For NM restart with no recovery support,  startContainer will fail anyways 
because the NMToken is not valid.
3. For work-preserving RM restart, containers launched before NM re-register 
can be recovered on RM when NM sends the container status across. 
startContainer call after re-register will fail because the NMToken is not 
valid. 

 NM restarts could lead to app failures
 --

 Key: YARN-3811
 URL: https://issues.apache.org/jira/browse/YARN-3811
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical

 Consider the following scenario:
 1. RM assigns a container on node N to an app A.
 2. Node N is restarted
 3. A tries to launch container on node N.
 3 could lead to an NMNotYetReadyException depending on whether NM N has 
 registered with the RM. In MR, this is considered a task attempt failure. A 
 few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >