date:20150730


[ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647580#comment-14647580
 ] 

Hudson commented on YARN-3950:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #261 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/261/])
YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. 
Contributed by Robert Kanter (jlowe: rev 
2b2bd9214604bc2e14e41e08d30bf86f512151bd)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java


 Add unique YARN_SHELL_ID environment variable to DistributedShell
 -

 Key: YARN-3950
 URL: https://issues.apache.org/jira/browse/YARN-3950
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Fix For: 2.8.0

 Attachments: YARN-3950.001.patch, YARN-3950.002.patch


 As discussed in [this 
 comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
  it would be useful to have a monotonically increasing and independent ID of 
 some kind that is unique per shell in the distributed shell program.
 We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start


[ 
https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647582#comment-14647582
 ] 

Hudson commented on YARN-3919:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #261 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/261/])
YARN-3919. NPEs' while stopping service after exception during 
CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) 
(rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java


 NPEs' while stopping service after exception during 
 CommonNodeLabelsManager#start
 -

 Key: YARN-3919
 URL: https://issues.apache.org/jira/browse/YARN-3919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Trivial
 Fix For: 2.8.0

 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, 
 YARN-3919.02.patch


 We get NPE during CommonNodeLabelsManager#serviceStop and 
 AsyncDispatcher#serviceStop if ConnectException on call to 
 CommonNodeLabelsManager#serviceStart occurs.
 {noformat}
 2015-07-10 19:39:37,825 WARN main-EventThread 
 org.apache.hadoop.service.AbstractService: When stopping the service 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : 
 java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
 {noformat}
 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
 at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 {noformat}
 These NPEs' fill up the logs. Although, this doesn't cause any functional 
 issue but its a nuisance and we ideally should have null checks in 
 serviceStop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647578#comment-14647578
 ] 

Hudson commented on YARN-2768:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #261 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/261/])
YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo 
via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
* hadoop-yarn-project/CHANGES.txt


 Avoid cloning Resource in FSAppAttempt#updateDemand
 ---

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-07-30 Thread mujunchao (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647585#comment-14647585
 ] 

mujunchao commented on YARN-3857:
-

Thanks Devar for reviewing,
1. I think ClientToAMTokenSecretManagerInRM#hasMasterKey() is necessary, in 
this case the value is null, so i new function to recognize it.
2. have fixed.
3. As i use annotation @VisibleForTesting,in my ide, need to import 
com.google.common.annotations.VisibleForTesting.

 Memory leak in ResourceManager with SIMPLE mode
 ---

 Key: YARN-3857
 URL: https://issues.apache.org/jira/browse/YARN-3857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: mujunchao
Assignee: mujunchao
Priority: Critical
 Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
 hadoop-yarn-server-resourcemanager.patch


  We register the ClientTokenMasterKey to avoid client may hold an invalid 
 ClientToken after RM restarts. In SIMPLE mode, we register 
 PairApplicationAttemptId, null ,  But we never remove it from HashMap, as 
 unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-07-30 Thread mujunchao (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mujunchao updated YARN-3857:

Attachment: YARN-3857-3.patch

fix the comments and format the code.

 Memory leak in ResourceManager with SIMPLE mode
 ---

 Key: YARN-3857
 URL: https://issues.apache.org/jira/browse/YARN-3857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: mujunchao
Assignee: mujunchao
Priority: Critical
 Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
 hadoop-yarn-server-resourcemanager.patch


  We register the ClientTokenMasterKey to avoid client may hold an invalid 
 ClientToken after RM restarts. In SIMPLE mode, we register 
 PairApplicationAttemptId, null ,  But we never remove it from HashMap, as 
 unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected


[ 
https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647503#comment-14647503
 ] 

Hadoop QA commented on YARN-3990:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 12s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   7m 37s | The applied patch generated  1  
additional warning messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 20s | The applied patch generated 
1 release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 46s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 20s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 25s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |  52m 38s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  90m 31s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12747961/0002-YARN-3990.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / ddc867ce |
| javac | 
https://builds.apache.org/job/PreCommit-YARN-Build/8714/artifact/patchprocess/diffJavacWarnings.txt
 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-YARN-Build/8714/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8714/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8714/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8714/console |


This message was automatically generated.

 AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is 
 connected/disconnected
 

 Key: YARN-3990
 URL: https://issues.apache.org/jira/browse/YARN-3990
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Rohith Sharma K S
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3990.patch, 0002-YARN-3990.patch


 Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent 
 to all the applications that are in the rmcontext. But for 
 finished/killed/failed applications it is not required to send these events. 
 Additional check for wheather app is finished/killed/failed would minimizes 
 the unnecessary events
 {code}
   public void handle(NodesListManagerEvent event) {
 RMNode eventNode = event.getNode();
 switch (event.getType()) {
 case NODE_UNUSABLE:
   LOG.debug(eventNode +  reported unusable);
   unusableRMNodesConcurrentSet.add(eventNode);
   for(RMApp app: rmContext.getRMApps().values()) {
 this.rmContext
 .getDispatcher()
 .getEventHandler()
 .handle(
 new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
 RMAppNodeUpdateType.NODE_UNUSABLE));
   }
   break;
 case NODE_USABLE:
   if (unusableRMNodesConcurrentSet.contains(eventNode)) {
 LOG.debug(eventNode +  reported usable);
 unusableRMNodesConcurrentSet.remove(eventNode);
   }
   for (RMApp app : rmContext.getRMApps().values()) {
 this.rmContext
 .getDispatcher()
 .getEventHandler()
 .handle(
 new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
 RMAppNodeUpdateType.NODE_USABLE));
   }
   break;
 default:
   LOG.error(Ignoring invalid eventtype  + event.getType());
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647497#comment-14647497
 ] 

Hudson commented on YARN-2768:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #272 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/272/])
YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo 
via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java


 Avoid cloning Resource in FSAppAttempt#updateDemand
 ---

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell


[ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647499#comment-14647499
 ] 

Hudson commented on YARN-3950:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #272 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/272/])
YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. 
Contributed by Robert Kanter (jlowe: rev 
2b2bd9214604bc2e14e41e08d30bf86f512151bd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java


 Add unique YARN_SHELL_ID environment variable to DistributedShell
 -

 Key: YARN-3950
 URL: https://issues.apache.org/jira/browse/YARN-3950
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Fix For: 2.8.0

 Attachments: YARN-3950.001.patch, YARN-3950.002.patch


 As discussed in [this 
 comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
  it would be useful to have a monotonically increasing and independent ID of 
 some kind that is unique per shell in the distributed shell program.
 We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start


[ 
https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647501#comment-14647501
 ] 

Hudson commented on YARN-3919:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #272 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/272/])
YARN-3919. NPEs' while stopping service after exception during 
CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) 
(rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java


 NPEs' while stopping service after exception during 
 CommonNodeLabelsManager#start
 -

 Key: YARN-3919
 URL: https://issues.apache.org/jira/browse/YARN-3919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Trivial
 Fix For: 2.8.0

 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, 
 YARN-3919.02.patch


 We get NPE during CommonNodeLabelsManager#serviceStop and 
 AsyncDispatcher#serviceStop if ConnectException on call to 
 CommonNodeLabelsManager#serviceStart occurs.
 {noformat}
 2015-07-10 19:39:37,825 WARN main-EventThread 
 org.apache.hadoop.service.AbstractService: When stopping the service 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : 
 java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
 {noformat}
 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
 at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 {noformat}
 These NPEs' fill up the logs. Although, this doesn't cause any functional 
 issue but its a nuisance and we ideally should have null checks in 
 serviceStop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode

2015-07-30 Thread mujunchao (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647568#comment-14647568
 ] 

mujunchao commented on YARN-3857:
-

thanks for your reviewing, have fixed the comments and the indentation.

 Memory leak in ResourceManager with SIMPLE mode
 ---

 Key: YARN-3857
 URL: https://issues.apache.org/jira/browse/YARN-3857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: mujunchao
Assignee: mujunchao
Priority: Critical
 Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, 
 hadoop-yarn-server-resourcemanager.patch


  We register the ClientTokenMasterKey to avoid client may hold an invalid 
 ClientToken after RM restarts. In SIMPLE mode, we register 
 PairApplicationAttemptId, null ,  But we never remove it from HashMap, as 
 unregister only runing while in Security mode, so memory leak coming. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

2015-07-30 Thread Jun Gong (JIRA)

Jun Gong created YARN-3998:
--

 Summary: Add retry-times to let NM re-launch container when it 
fails to run
 Key: YARN-3998
 URL: https://issues.apache.org/jira/browse/YARN-3998
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong


I'd like to add a field(retry-times) in ContainerLaunchContext. When AM 
launches containers, it could specify the value. Then NM will re-launch the 
container 'retry-times' times when it fails to run(e.g.exit code is not 0). 

It will save a lot of time. It avoids container localization. RM does not need 
to re-schedule the container. And local files in container's working directory 
will be left for re-use.(If container have downloaded some big files, it does 
not need to re-download them when running again.) 

We find it is useful in systems like Storm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647824#comment-14647824
 ] 

Hudson commented on YARN-2768:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2218 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2218/])
YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo 
via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java
* hadoop-yarn-project/CHANGES.txt


 Avoid cloning Resource in FSAppAttempt#updateDemand
 ---

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell


[ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647826#comment-14647826
 ] 

Hudson commented on YARN-3950:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2218 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2218/])
YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. 
Contributed by Robert Kanter (jlowe: rev 
2b2bd9214604bc2e14e41e08d30bf86f512151bd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java
* hadoop-yarn-project/CHANGES.txt


 Add unique YARN_SHELL_ID environment variable to DistributedShell
 -

 Key: YARN-3950
 URL: https://issues.apache.org/jira/browse/YARN-3950
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Fix For: 2.8.0

 Attachments: YARN-3950.001.patch, YARN-3950.002.patch


 As discussed in [this 
 comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
  it would be useful to have a monotonically increasing and independent ID of 
 some kind that is unique per shell in the distributed shell program.
 We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start


[ 
https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647828#comment-14647828
 ] 

Hudson commented on YARN-3919:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2218 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2218/])
YARN-3919. NPEs' while stopping service after exception during 
CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) 
(rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java


 NPEs' while stopping service after exception during 
 CommonNodeLabelsManager#start
 -

 Key: YARN-3919
 URL: https://issues.apache.org/jira/browse/YARN-3919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Trivial
 Fix For: 2.8.0

 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, 
 YARN-3919.02.patch


 We get NPE during CommonNodeLabelsManager#serviceStop and 
 AsyncDispatcher#serviceStop if ConnectException on call to 
 CommonNodeLabelsManager#serviceStart occurs.
 {noformat}
 2015-07-10 19:39:37,825 WARN main-EventThread 
 org.apache.hadoop.service.AbstractService: When stopping the service 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : 
 java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
 {noformat}
 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
 at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 {noformat}
 These NPEs' fill up the logs. Although, this doesn't cause any functional 
 issue but its a nuisance and we ideally should have null checks in 
 serviceStop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647692#comment-14647692
 ] 

Hudson commented on YARN-2768:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2199 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2199/])
YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo 
via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java


 Avoid cloning Resource in FSAppAttempt#updateDemand
 ---

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3965) Add starup timestamp for nodemanager


 [ 
https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo updated YARN-3965:
--
Attachment: YARN-3965-3.patch

 Add starup timestamp for nodemanager
 

 Key: YARN-3965
 URL: https://issues.apache.org/jira/browse/YARN-3965
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965.patch


 We have startup timestamp for RM already, but don't for NM.
 Sometimes cluster operator modified configuration of all nodes and kicked off 
 command to restart all NMs.  He found out it's hard for him to check whether 
 all NMs are restarted.  Actually there's always some NMs didn't restart as he 
 expected, which leads to some error later due to inconsistent configuration.
 If we have startup timestamp for NM,  the operator could easily fetch it via 
 NM webservice and find out which NM didn't restart, and take mannaul action 
 for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected


[ 
https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647743#comment-14647743
 ] 

Jason Lowe commented on YARN-3990:
--

Please look into the java and release audit warnings.

The new TestNMUpdateEvent file probably should just be named 
TestNodesListManager so we can add more tests specific to the NodesListManager 
there.

Rather than using mock RM and NM objects and running some RMApps it seems like 
we could have just isolated the NodesListManager more directly.  We can hand it 
a mocked RMContext that returns a pre-baked list of apps that are alive and 
apps that are finished and then verify when we send a node usable/unusable 
event that the appropriate update events are sent to the appropriate apps by 
registering our own event handler with something like a drain or inline 
dispatcher.  Not a must fix, just wondering if that approach was considered.

 AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is 
 connected/disconnected
 

 Key: YARN-3990
 URL: https://issues.apache.org/jira/browse/YARN-3990
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Rohith Sharma K S
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3990.patch, 0002-YARN-3990.patch


 Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent 
 to all the applications that are in the rmcontext. But for 
 finished/killed/failed applications it is not required to send these events. 
 Additional check for wheather app is finished/killed/failed would minimizes 
 the unnecessary events
 {code}
   public void handle(NodesListManagerEvent event) {
 RMNode eventNode = event.getNode();
 switch (event.getType()) {
 case NODE_UNUSABLE:
   LOG.debug(eventNode +  reported unusable);
   unusableRMNodesConcurrentSet.add(eventNode);
   for(RMApp app: rmContext.getRMApps().values()) {
 this.rmContext
 .getDispatcher()
 .getEventHandler()
 .handle(
 new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
 RMAppNodeUpdateType.NODE_UNUSABLE));
   }
   break;
 case NODE_USABLE:
   if (unusableRMNodesConcurrentSet.contains(eventNode)) {
 LOG.debug(eventNode +  reported usable);
 unusableRMNodesConcurrentSet.remove(eventNode);
   }
   for (RMApp app : rmContext.getRMApps().values()) {
 this.rmContext
 .getDispatcher()
 .getEventHandler()
 .handle(
 new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
 RMAppNodeUpdateType.NODE_USABLE));
   }
   break;
 default:
   LOG.error(Ignoring invalid eventtype  + event.getType());
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell


[ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647785#comment-14647785
 ] 

Hudson commented on YARN-3950:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #269 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/269/])
YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. 
Contributed by Robert Kanter (jlowe: rev 
2b2bd9214604bc2e14e41e08d30bf86f512151bd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* hadoop-yarn-project/CHANGES.txt


 Add unique YARN_SHELL_ID environment variable to DistributedShell
 -

 Key: YARN-3950
 URL: https://issues.apache.org/jira/browse/YARN-3950
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Fix For: 2.8.0

 Attachments: YARN-3950.001.patch, YARN-3950.002.patch


 As discussed in [this 
 comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
  it would be useful to have a monotonically increasing and independent ID of 
 some kind that is unique per shell in the distributed shell program.
 We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3965) Add starup timestamp for nodemanager


[ 
https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647799#comment-14647799
 ] 

Hadoop QA commented on YARN-3965:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m  2s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m 16s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 19s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 41s | The applied patch generated  2 
new checkstyle issues (total was 46, now 48). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 24s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 21s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   6m 14s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  46m 16s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12747995/YARN-3965-3.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / ddc867ce |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8715/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8715/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8715/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8715/console |


This message was automatically generated.

 Add starup timestamp for nodemanager
 

 Key: YARN-3965
 URL: https://issues.apache.org/jira/browse/YARN-3965
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965.patch


 We have startup timestamp for RM already, but don't for NM.
 Sometimes cluster operator modified configuration of all nodes and kicked off 
 command to restart all NMs.  He found out it's hard for him to check whether 
 all NMs are restarted.  Actually there's always some NMs didn't restart as he 
 expected, which leads to some error later due to inconsistent configuration.
 If we have startup timestamp for NM,  the operator could easily fetch it via 
 NM webservice and find out which NM didn't restart, and take mannaul action 
 for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell


[ 
https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647694#comment-14647694
 ] 

Hudson commented on YARN-3950:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2199 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2199/])
YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. 
Contributed by Robert Kanter (jlowe: rev 
2b2bd9214604bc2e14e41e08d30bf86f512151bd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java


 Add unique YARN_SHELL_ID environment variable to DistributedShell
 -

 Key: YARN-3950
 URL: https://issues.apache.org/jira/browse/YARN-3950
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: applications/distributed-shell
Affects Versions: 2.8.0
Reporter: Robert Kanter
Assignee: Robert Kanter
 Fix For: 2.8.0

 Attachments: YARN-3950.001.patch, YARN-3950.002.patch


 As discussed in [this 
 comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027],
  it would be useful to have a monotonically increasing and independent ID of 
 some kind that is unique per shell in the distributed shell program.
 We can do that by adding a SHELL_ID env var.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start


[ 
https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647696#comment-14647696
 ] 

Hudson commented on YARN-3919:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #2199 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2199/])
YARN-3919. NPEs' while stopping service after exception during 
CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) 
(rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java


 NPEs' while stopping service after exception during 
 CommonNodeLabelsManager#start
 -

 Key: YARN-3919
 URL: https://issues.apache.org/jira/browse/YARN-3919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Trivial
 Fix For: 2.8.0

 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, 
 YARN-3919.02.patch


 We get NPE during CommonNodeLabelsManager#serviceStop and 
 AsyncDispatcher#serviceStop if ConnectException on call to 
 CommonNodeLabelsManager#serviceStart occurs.
 {noformat}
 2015-07-10 19:39:37,825 WARN main-EventThread 
 org.apache.hadoop.service.AbstractService: When stopping the service 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : 
 java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
 {noformat}
 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
 at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 {noformat}
 These NPEs' fill up the logs. Although, this doesn't cause any functional 
 issue but its a nuisance and we ideally should have null checks in 
 serviceStop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start


[ 
https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647787#comment-14647787
 ] 

Hudson commented on YARN-3919:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #269 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/269/])
YARN-3919. NPEs' while stopping service after exception during 
CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) 
(rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java


 NPEs' while stopping service after exception during 
 CommonNodeLabelsManager#start
 -

 Key: YARN-3919
 URL: https://issues.apache.org/jira/browse/YARN-3919
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Varun Saxena
Assignee: Varun Saxena
Priority: Trivial
 Fix For: 2.8.0

 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, 
 YARN-3919.02.patch


 We get NPE during CommonNodeLabelsManager#serviceStop and 
 AsyncDispatcher#serviceStop if ConnectException on call to 
 CommonNodeLabelsManager#serviceStart occurs.
 {noformat}
 2015-07-10 19:39:37,825 WARN main-EventThread 
 org.apache.hadoop.service.AbstractService: When stopping the service 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : 
 java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
 {noformat}
 {noformat}
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 at 
 org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
 at 
 org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
 at 
 org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
 at 
 org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
 at 
 org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
 {noformat}
 These NPEs' fill up the logs. Although, this doesn't cause any functional 
 issue but its a nuisance and we ideally should have null checks in 
 serviceStop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand


[ 
https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647783#comment-14647783
 ] 

Hudson commented on YARN-2768:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #269 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/269/])
YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo 
via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java


 Avoid cloning Resource in FSAppAttempt#updateDemand
 ---

 Key: YARN-2768
 URL: https://issues.apache.org/jira/browse/YARN-2768
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Fix For: 2.8.0

 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png


 See the attached picture of profiling result. The clone of Resource object 
 within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the 
 function FairScheduler.update().
 The code of FSAppAttempt.updateDemand:
 {code}
 public void updateDemand() {
 demand = Resources.createResource(0);
 // Demand is current consumption plus outstanding requests
 Resources.addTo(demand, app.getCurrentConsumption());
 // Add up outstanding resource requests
 synchronized (app) {
   for (Priority p : app.getPriorities()) {
 for (ResourceRequest r : app.getResourceRequests(p).values()) {
   Resource total = Resources.multiply(r.getCapability(), 
 r.getNumContainers());
   Resources.addTo(demand, total);
 }
   }
 }
   }
 {code}
 The code of Resources.multiply:
 {code}
 public static Resource multiply(Resource lhs, double by) {
 return multiplyTo(clone(lhs), by);
 }
 {code}
 The clone could be skipped by directly update the value of this.demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3965) Add starup timestamp for nodemanager


[ 
https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647709#comment-14647709
 ] 

Hong Zhiguo commented on YARN-3965:
---

made it private with Getter.
Hi, [~zxu], [~jlowe], could please review the patch?

 Add starup timestamp for nodemanager
 

 Key: YARN-3965
 URL: https://issues.apache.org/jira/browse/YARN-3965
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965.patch


 We have startup timestamp for RM already, but don't for NM.
 Sometimes cluster operator modified configuration of all nodes and kicked off 
 command to restart all NMs.  He found out it's hard for him to check whether 
 all NMs are restarted.  Actually there's always some NMs didn't restart as he 
 expected, which leads to some error later due to inconsistent configuration.
 If we have startup timestamp for NM,  the operator could easily fetch it via 
 NM webservice and find out which NM didn't restart, and take mannaul action 
 for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run

[
https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647755#comment-14647755
]

Jason Lowe commented on YARN-3998:
--

Is this really a feature that YARN needs to provide? To me this is basically a
case of container re-use which the application itself can control. A primitive
example would be an application that launches a container that wraps the real
task in a wrapper shell script or Java program that spawns the real task and
will respawn it some number of times if the real task fails before failing the
entire container. I'm not sure YARN is the best place to put this
functionality.

Add retry-times to let NM re-launch container when it fails to run
--

Key: YARN-3998
URL: https://issues.apache.org/jira/browse/YARN-3998
Project: Hadoop YARN
Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong

I'd like to add a field(retry-times) in ContainerLaunchContext. When AM
launches containers, it could specify the value. Then NM will re-launch the
container 'retry-times' times when it fails to run(e.g.exit code is not 0).
It will save a lot of time. It avoids container localization. RM does not
need to re-schedule the container. And local files in container's working
directory will be left for re-use.(If container have downloaded some big
files, it does not need to re-download them when running again.)
We find it is useful in systems like Storm.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS

2015-07-30 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648175#comment-14648175
 ] 

Eric Payne commented on YARN-3978:
--

{{checkstyle}} indicates that {{YarnConfiguration.java}} is too long. I will 
not be fixing that as part of this JIRA. Everything else from the build seems 
to be okay.

[~jeagles], can you please have a look at this patch?

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Affects Versions: 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch, YARN-3978.002.patch, 
 YARN-3978.003.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3983) Make CapacityScheduler to easier extend application allocation logic


[ 
https://issues.apache.org/jira/browse/YARN-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648178#comment-14648178
 ] 

Hadoop QA commented on YARN-3983:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m 11s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 42s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 19s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 47s | The applied patch generated  
30 new checkstyle issues (total was 54, now 55). |
| {color:red}-1{color} | whitespace |   0m  2s | The patch has 16  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 21s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 24s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |  49m 29s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  87m 37s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA |
|   | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
|   | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
 |
| Timed out tests | 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748038/YARN-3983.2.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 91b42e7 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8720/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8720/console |


This message was automatically generated.

 Make CapacityScheduler to easier extend application allocation logic
 

 Key: YARN-3983
 URL: https://issues.apache.org/jira/browse/YARN-3983
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-3983.1.patch, YARN-3983.2.patch


 While working on YARN-1651 (resource allocation for increasing container), I 
 found it is very hard to extend existing CapacityScheduler resource 
 allocation logic to support different types of resource allocation.
 For example, there's a lot of differences between increasing a container and 
 allocating a container:
 - Increasing a container doesn't need to check locality delay.
 - Increasing a container doesn't need to build/modify a resource request tree 
 (ANY-RACK/HOST).
 - Increasing a container doesn't need to check allocation/reservation 
 starvation (see {{shouldAllocOrReserveNewContainer}}).
 - After increasing a container is approved by scheduler, it need to update an 
 existing container token instead of creating new container.
 And there're lots of similarities when allocating different types of 
 resources.
 - User-limit/queue-limit will be enforced for both of them.
 - Both of them needs resource reservation logic. (Maybe continuous 
 reservation looking is needed for both of them).
 The purpose of this JIRA is to make easier extending CapacityScheduler 
 resource allocation logic to support different types of resource allocation, 
 make common code reusable, and also better code organization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3906) split the application table from the entity table


[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648169#comment-14648169
 ] 

Hadoop QA commented on YARN-3906:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 31s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 56s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 48s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 15s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  7s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 25s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 50s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 29s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  38m 30s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748043/YARN-3906-YARN-2928.001.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / df0ec47 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8721/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8721/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8721/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8721/console |


This message was automatically generated.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3906-YARN-2928.001.patch


 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2884) Proxying all AM-RM communications

2015-07-30 Thread Kishore Chaliparambil (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishore Chaliparambil updated YARN-2884:

Attachment: YARN-2884-V7.patch

Fixed the javadoc warnings

 Proxying all AM-RM communications
 -

 Key: YARN-2884
 URL: https://issues.apache.org/jira/browse/YARN-2884
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Carlo Curino
Assignee: Kishore Chaliparambil
 Attachments: YARN-2884-V1.patch, YARN-2884-V2.patch, 
 YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch, 
 YARN-2884-V6.patch, YARN-2884-V7.patch


 We introduce the notion of an RMProxy, running on each node (or once per 
 rack). Upon start the AM is forced (via tokens and configuration) to direct 
 all its requests to a new services running on the NM that provide a proxy to 
 the central RM. 
 This give us a place to:
 1) perform distributed scheduling decisions
 2) throttling mis-behaving AMs
 3) mask the access to a federation of RMs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend

2015-07-30 Thread Zhijie Shen (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648218#comment-14648218
]

Zhijie Shen commented on YARN-3049:
---

[~gtCarrera9], thanks for review. I've addressed most of your comments in the
new patch exception followings:

bq. However, I still incline to proceed the changes in this JIRA so that we can
speed up consolidating our POC patches.

Exactly.

bq. Reader interface: use TimelineCollectorContext to package reader arguments?

Yeah, I can see the rationale behind it, but maybe it's not
TimelineCollectorContext. As I see a lot of arguments for the reader interface
(as well as the writer one) and the potential signature change in future (e.g,
adding newApp in this patch), I start to think of grouping the primitive
arguments, shielding them in some category object, such as EntityContext,
EntityFilters, Opts and so on, and using these as the arguments of the
interface instead. Therefore, if we want to add newApp here, we don't really
need to change the method signature, but add a getter/setter in Opts. Please
let me know how you think about the idea. I can file another jira to deal with
the issue.

bq. We're now performing filters by ourselves in memory. I'm wondering if it
will be more efficient to translate some of our filter specifications into
HBase filters?

That sounds a good idea, which should potentially improve the read performance.
Let me do some investigation how to map our filter into HBase filter and push
it to the backend. Given it may be a non-trivial work, can we get this patch in
and follow up the filter change in another jira just in case?

bq. Add a specific test in TestHBaseTimelineWriterImpl for App2FlowTable?

In fact, it has been tested. I change the write path by letting newApp = true,
and check if we can query the entity successfully without giving the
flow/flowRun explicitly. However, I didn't do much assertion around the fields
of retrieved entities, because I consider of deferring this work together with
rewriting the whole HBase backend unit test. The current tests are too
preliminary to capture the potential bugs around DB operations.

[Storage Implementation] Implement storage reader interface to fetch raw data
from HBase backend

Key: YARN-3049
URL: https://issues.apache.org/jira/browse/YARN-3049
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch,
YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch,
YARN-3049-YARN-2928.3.patch

Implement existing ATS queries with the new ATS reader design.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend


[ 
https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648336#comment-14648336
 ] 

Hadoop QA commented on YARN-3049:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  17m  2s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 3 new or modified test files. |
| {color:green}+1{color} | javac |   7m 48s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 45s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 42s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m 16s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 25s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 46s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |  53m  2s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| {color:green}+1{color} | yarn tests |   1m 24s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  97m 43s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748046/YARN-3049-YARN-2928.3.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / df0ec47 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8722/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8722/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8722/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8722/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8722/console |


This message was automatically generated.

 [Storage Implementation] Implement storage reader interface to fetch raw data 
 from HBase backend
 

 Key: YARN-3049
 URL: https://issues.apache.org/jira/browse/YARN-3049
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, 
 YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch, 
 YARN-3049-YARN-2928.3.patch


 Implement existing ATS queries with the new ATS reader design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3999) Add a timeout when drain the dispatcher


 [ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He reassigned YARN-3999:
-

Assignee: Jian He

 Add a timeout when drain the dispatcher
 ---

 Key: YARN-3999
 URL: https://issues.apache.org/jira/browse/YARN-3999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 If external systems like ATS, or ZK becomes very slow, draining all the 
 events take a lot of time. If this time becomes larger than 10 mins, all 
 applications will expire. We can add a timeout and stop the dispatcher even 
 if not all events are drained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend

[
https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648243#comment-14648243
]

Li Lu commented on YARN-3049:
-

Hi [~zjshen]! Some of my comments:

bq. As I see a lot of arguments for the reader interface (as well as the writer
one) and the potential signature change in future (e.g, adding newApp in this
patch), I start to think of grouping the primitive arguments, shielding them in
some category object, such as EntityContext, EntityFilters, Opts and so on, and
using these as the arguments of the interface instead.

I agree. Actually I spent quite some time wondering if we really need to add
the {{newApp}} argument in this patch. Encapsulating all related information
into a category object appears to be a nice way to avoid future interface
changes. +1.

bq. Given it may be a non-trivial work, can we get this patch in and follow up
the filter change in another jira just in case?

Definitely. Let's consolidate the whole workflow first. Then we can start these
improvements.

bq. In fact, it has been tested. I change the write path by letting newApp =
true, and check if we can query the entity successfully without giving the
flow/flowRun explicitly. However, I didn't do much assertion around the fields
of retrieved entities, because I consider of deferring this work together with
rewriting the whole HBase backend unit test.

Sounds good to me.

[Storage Implementation] Implement storage reader interface to fetch raw data
from HBase backend

Implement existing ATS queries with the new ATS reader design.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup


[ 
https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648245#comment-14648245
 ] 

Wangda Tan commented on YARN-2923:
--

Thanks for updating, [~Naganarasimha]! Some comments:

1) All script provider related configurations/logics should be removed. They 
should come with different patch.
  Such as: {{case YarnConfiguration.SCRIPT_NODE_LABELS_PROVIDER:}}
  And {{public static final String SCRIPT_NODE_LABELS_PROVIDER = script;}}
  Should be removed

2) For logics in NodeStatusUpdater. Things in my mind are:
- PreviousNodeLabels will be reset everytime if we do a fetch. (To avoid handle 
same node labels as much as possible)
- Don't reset node label if fetched node label is incorrect. (This should be a 
part of error handling, we should treat it's a error to be avoided instead of 
force reset it)
- Don't do check if new fetched node label is as same as previousNodeLabels. 
(Also, avoid handle same node label)

A little cosmetic suggestion. I found {{startStatusUpdater}} is too complex, 
full of try catch, etc. I suggest to make label related logic to be: a. Fetch 
label, check. b. handle response from RM and post process. Each of them should 
a separated method to improve readability.

I suggest to keep provider within nodemanager (instead of yarn-server-common) 
for this patch, we can move it if we decide to do that in the future.

Please let me know about your thoughts.

Wangda

 Support configuration based NodeLabelsProvider Service in Distributed Node 
 Label Configuration Setup 
 -

 Key: YARN-2923
 URL: https://issues.apache.org/jira/browse/YARN-2923
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
 Fix For: 2.8.0

 Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, 
 YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, 
 YARN-2923.20150517-1.patch


 As part of Distributed Node Labels configuration we need to support Node 
 labels to be configured in Yarn-site.xml. And on modification of Node Labels 
 configuration in yarn-site.xml, NM should be able to get modified Node labels 
 from this NodeLabelsprovider service without NM restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3906) split the application table from the entity table


[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648328#comment-14648328
 ] 

Sangjin Lee commented on YARN-3906:
---

Yes, that's a good point. I don't think the conflict will be that bad. We'll 
just see how these JIRAs go, and we'll adjust whichever JIRA that goes later.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3906-YARN-2928.001.patch, 
 YARN-3906-YARN-2928.002.patch


 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend

2015-07-30 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-3049:
--
Attachment: YARN-3049-YARN-2928.3.patch

 [Storage Implementation] Implement storage reader interface to fetch raw data 
 from HBase backend
 

 Key: YARN-3049
 URL: https://issues.apache.org/jira/browse/YARN-3049
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Zhijie Shen
 Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, 
 YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch, 
 YARN-3049-YARN-2928.3.patch


 Implement existing ATS queries with the new ATS reader design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3906) split the application table from the entity table


 [ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3906:
--
Attachment: YARN-3906-YARN-2928.002.patch

v.2 patch posted.

Fixed the whitespace.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3906-YARN-2928.001.patch, 
 YARN-3906-YARN-2928.002.patch


 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3906) split the application table from the entity table


[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648382#comment-14648382
 ] 

Hadoop QA commented on YARN-3906:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 39s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m  3s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 56s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 16s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  6s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 26s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 41s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 47s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 23s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  38m 44s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748070/YARN-3906-YARN-2928.002.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / df0ec47 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8724/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8724/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8724/console |


This message was automatically generated.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3906-YARN-2928.001.patch, 
 YARN-3906-YARN-2928.002.patch


 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart


[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648416#comment-14648416
 ] 

Jason Lowe commented on YARN-4000:
--

Example stacktrace:
{noformat}
2015-07-30 22:12:03,424 ERROR [main] resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(582)) - Failed to load/recover state
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:792)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1320)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:128)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1075)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1032)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:890)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2100(RMAppImpl.java:109)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:938)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:895)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:761)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:323)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:433)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1157)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:577)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1041)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1185)
2015-07-30 22:12:03,425 INFO  [main] service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service RMActiveServices failed in 
state STARTED; cause: java.lang.NullPointerException
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:792)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1320)
at

[jira] [Commented] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers

2015-07-30 Thread Zhijie Shen (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648457#comment-14648457
]

Zhijie Shen commented on YARN-3904:
---

[~gtCarrera9], thanks for the patch. Bellow are my comments:

bq. The two failed tests passed on my local machine, and the failures appeared
to be irrelevant. This said, we may still need to fix those intermittent test
failures.

Do we plan to fix it in this patch?

Some high level comments:

1. As is also mentioned in YARN-3049, how about we refactoring reader/writer
method signature in a separate jira to avoid conflicts?

2. I suggest moving the table creation stuff into TimelineSchemaCreator.

3. As HBase backend is accessed both directly and via Phoenix, it's good for us
to cleanup the configuration to say we're using the HBase backend (comparing to
FS backend) instead of specifically HBase or Phoenix writer/reader.

Other patch details:

1. Make OfflineAggregationWriter extend Service, such that you don't need to
define init.

2. Now we're working towards a production standard patch. Would you please
write some javadoc to explain the schema of the aggregation tables like what we
did for HBase tables.

3. The connection config should be moved to YarnConfiguration.

4. Why is info column family kept? I expect the aggregation table will only
have metrics data

5. Let's also have a default PhoenixOfflineAggregationWriterImpl constructor to
be used in the production code.

6. {{Class.forName(DRIVER_CLASS_NAME);}} doesn't need to be invoked every time
we get a connection.

Refactor timelineservice.storage to add support to online and offline
aggregation writers
-

Key: YARN-3904
URL: https://issues.apache.org/jira/browse/YARN-3904
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Li Lu
Assignee: Li Lu
Attachments: YARN-3904-YARN-2928.001.patch,
YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch,
YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch

After we finished the design for time-based aggregation, we can adopt our
existing Phoenix storage into the storage of the aggregated data. In this
JIRA, I'm proposing to refactor writers to add support to aggregation
writers. Offline aggregation writers typically has less contextual
information. We can distinguish these writers by special naming. We can also
use CollectorContexts to model all contextual information and use it in our
writer interfaces.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3906) split the application table from the entity table


 [ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3906:
--
Attachment: YARN-3906-YARN-2928.003.patch

v.3

Forgot to add the application table to the schema creation.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3906-YARN-2928.001.patch, 
 YARN-3906-YARN-2928.002.patch, YARN-3906-YARN-2928.003.patch


 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3906) split the application table from the entity table


[ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648574#comment-14648574
 ] 

Hadoop QA commented on YARN-3906:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 46s | Findbugs (version ) appears to 
be broken on YARN-2928. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 53s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 48s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 16s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  7s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 27s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 40s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 50s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 23s | Tests passed in 
hadoop-yarn-server-timelineservice. |
| | |  38m 36s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748090/YARN-3906-YARN-2928.003.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-2928 / df0ec47 |
| hadoop-yarn-server-timelineservice test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8727/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8727/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8727/console |


This message was automatically generated.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3906-YARN-2928.001.patch, 
 YARN-3906-YARN-2928.002.patch, YARN-3906-YARN-2928.003.patch


 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3984) Rethink event column key issue


[ 
https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648540#comment-14648540
 ] 

Sangjin Lee commented on YARN-3984:
---

To be clear, with the latter option, if we want to look for an event by id, we 
can use {{ColumnPrefixFilter}} for {{e! eventId}}, right? So in that case we 
won't need to fetch all columns, correct?

 Rethink event column key issue
 --

 Key: YARN-3984
 URL: https://issues.apache.org/jira/browse/YARN-3984
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Vrushali C
 Fix For: YARN-2928


 Currently, the event column key is event_id?info_key?timestamp, which is not 
 so friendly to fetching all the events of an entity and sorting them in a 
 chronologic order. IMHO, timestamp?event_id?info_key may be a better key 
 schema. I open this jira to continue the discussion about it which was 
 commented on YARN-3908.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2884) Proxying all AM-RM communications


[ 
https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648431#comment-14648431
 ] 

Hadoop QA commented on YARN-2884:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  19m 48s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 6 new or modified test files. |
| {color:green}+1{color} | javac |   7m 45s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 40s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 13s | The applied patch generated  1 
new checkstyle issues (total was 237, now 237). |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 23s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   6m 47s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 55s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-server-common. |
| {color:green}+1{color} | yarn tests |   6m 11s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| {color:green}+1{color} | yarn tests |  52m 26s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 111m  1s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748065/YARN-2884-V7.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 88d8736 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8723/console |


This message was automatically generated.

 Proxying all AM-RM communications
 -

 Key: YARN-2884
 URL: https://issues.apache.org/jira/browse/YARN-2884
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Carlo Curino
Assignee: Kishore Chaliparambil
 Attachments: YARN-2884-V1.patch, YARN-2884-V2.patch, 
 YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch, 
 YARN-2884-V6.patch, YARN-2884-V7.patch


 We introduce the notion of an RMProxy, running on each node (or once per 
 rack). Upon start the AM is forced (via tokens and configuration) to direct 
 all its requests to a new services running on the NM that provide a proxy to 
 the central RM. 
 This give us a place to:
 1) perform distributed scheduling decisions
 2) throttling mis-behaving AMs
 3) mask the access to a federation of RMs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-07-30 Thread Arun Suresh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648488#comment-14648488
 ] 

Arun Suresh commented on YARN-2005:
---

Thanks for the patch [~adhoot],

Couple of comments:
# noBlacklist in DisabledBlacklistManager can be made static final.
# {{getNumClusterHosts()}} in AbstractYarnScheduler : Any reason we are 
creating a new set ? think returning this.nodes.size() should suffice right ?
# Instead of removing from the shared blacklist cause problems if the shared 
blacklist already contained the blacklisted node ?

 Blacklisting support for scheduling AMs
 ---

 Key: YARN-2005
 URL: https://issues.apache.org/jira/browse/YARN-2005
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 0.23.10, 2.4.0
Reporter: Jason Lowe
Assignee: Anubhav Dhoot
 Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
 YARN-2005.003.patch, YARN-2005.004.patch


 It would be nice if the RM supported blacklisting a node for an AM launch 
 after the same node fails a configurable number of AM attempts.  This would 
 be similar to the blacklisting support for scheduling task attempts in the 
 MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

Jason Lowe created YARN-4000:


 Summary: RM crashes with NPE if leaf queue becomes parent queue 
during restart
 Key: YARN-4000
 URL: https://issues.apache.org/jira/browse/YARN-4000
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Jason Lowe


This is a similar situation to YARN-2308.  If an application is active in queue 
A and then the RM restarts with a changed capacity scheduler configuration 
where queue A becomes a parent queue to other subqueues then the RM will crash 
with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3984) Rethink event column key issue


[ 
https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648521#comment-14648521
 ] 

Li Lu commented on YARN-3984:
-

Thanks! I think I'm leaning towards to eventid#inverse_event_timestamp?eventKey 
then, if we have to do the sorting in memory anyways. 

 Rethink event column key issue
 --

 Key: YARN-3984
 URL: https://issues.apache.org/jira/browse/YARN-3984
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Vrushali C
 Fix For: YARN-2928


 Currently, the event column key is event_id?info_key?timestamp, which is not 
 so friendly to fetching all the events of an entity and sorting them in a 
 chronologic order. IMHO, timestamp?event_id?info_key may be a better key 
 schema. I open this jira to continue the discussion about it which was 
 commented on YARN-3908.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3999) Add a timeout when drain the dispatcher


 [ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3999:
--
Attachment: YARN-3999.patch

 Add a timeout when drain the dispatcher
 ---

 Key: YARN-3999
 URL: https://issues.apache.org/jira/browse/YARN-3999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3999.patch


 If external systems like ATS, or ZK becomes very slow, draining all the 
 events take a lot of time. If this time becomes larger than 10 mins, all 
 applications will expire. We can add a timeout and stop the dispatcher even 
 if not all events are drained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3984) Rethink event column key issue

2015-07-30 Thread Vrushali C (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648516#comment-14648516
 ] 

Vrushali C commented on YARN-3984:
--

If the query has the exact timestamp as well event id, then we can. But for 
queries like Give me information about CONTAINER KILLED events for this 
application, we won't be able to return this information without querying for 
all events in this application. 

 Rethink event column key issue
 --

 Key: YARN-3984
 URL: https://issues.apache.org/jira/browse/YARN-3984
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Vrushali C
 Fix For: YARN-2928


 Currently, the event column key is event_id?info_key?timestamp, which is not 
 so friendly to fetching all the events of an entity and sorting them in a 
 chronologic order. IMHO, timestamp?event_id?info_key may be a better key 
 schema. I open this jira to continue the discussion about it which was 
 commented on YARN-3908.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3906) split the application table from the entity table

[
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648359#comment-14648359
]

Li Lu commented on YARN-3906:
-

Hi [~sjlee0], I looked at the patch, and have one general question. It appears
that the application table reuses most of the data schema of the entity table,
with just some slight changes on its row keys. I've also noticed that the newly
added Application*.java files overlap significantly with Entity*.java. While
the current patch is totally fine in its core functions, I'm wondering if it is
possible to reuse most of the code from the entity table. Ideally, we may want
to build our *Table, *ColumnFamily, etc. on each new data schema, rather than
on each new table? IIUC, two most significant differences between the entity
table and the application table are table names and row key structures. Maybe
we can change the Entity* classes to allow those differences? Or, am I missing
any key points here?

split the application table from the entity table
-

Key: YARN-3906
URL: https://issues.apache.org/jira/browse/YARN-3906
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
Attachments: YARN-3906-YARN-2928.001.patch,
YARN-3906-YARN-2928.002.patch

Per discussions on YARN-3815, we need to split the application entities from
the main entity table into its own table (application).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3906) split the application table from the entity table

[
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648363#comment-14648363
]

Li Lu commented on YARN-3906:
-

...fixing formatting problems...

bq. Hi Sangjin Lee, I looked at the patch, and have one general question. It
appears that the application table reuses most of the data schema of the entity
table, with just some slight changes on its row keys. I've also noticed that
the newly added Application\*.java files overlap significantly with
Entity\*.java. While the current patch is totally fine in its core functions,
I'm wondering if it is possible to reuse most of the code from the entity
table. Ideally, we may want to build our \*Table, \*ColumnFamily, etc. on each
new data schema, rather than on each new table? IIUC, two most significant
differences between the entity table and the application table are table names
and row key structures. Maybe we can change the Entity classes to allow those
differences? Or, am I missing any key points here?

split the application table from the entity table
-

Per discussions on YARN-3815, we need to split the application entities from
the main entity table into its own table (application).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3906) split the application table from the entity table

[
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648472#comment-14648472
]

Sangjin Lee commented on YARN-3906:
---

bq. I've also noticed that the newly added Application*.java files overlap
significantly with Entity*.java.

Thanks for bringing up that point [~gtCarrera9]. I should have added some
explanations on why I wrote it this way. That is the first thing I noticed as I
looked into adding the new table.

\*Table and \*RowKey are not so bad, but \*ColumnFamily, \*Column, and
\*ColumnPrefix have definitely a lot of overlapping code. That is largely an
artifact of the design decision to use enums to implement these classes. Enums
are nice because it lets us seal the list of members cleanly, and the code that
uses the API becomes very strongly typed. On the other hand, the downside is
that enums cannot be extended.

If enums could be extended, we could have created a base class that's common
both for the entity table and the application table, and have the entity table
and the application table extend it pretty trivially. But unfortunately it
doesn't work with enums. Nor does Java have an option of mix-ins like scala.

As a way to minimize the duplication, we introduced {{ColumnHelper}} to provide
many of the common operations into that helper class. You'll notice that most
of the implementations in the \*Column\* classes are simple pass-through to
{{ColumnHelper}}.

This issue is more pronounced because the entity table and the application
table are so similar. For example, for the app-to-flow table (which Zhijie is
working on), this might not be as big an issue.

We could think of some alternatives, but I think they also have their own
challenges. First, we could think of having only one set of classes both for
the entity table and the application table, and controlling which one to use
via some sort of an argument/flag. But then the problem is that we would have
lots of {{if application ... else ...}} code scattered around in that single
implementation. I'm not sure if it is an improvement.

Eventually, if this becomes more of a need, we could envision writing some sort
of code generation and the table/schema description instruction so that given
the schema description these classes can be simply code-generated. However, as
you may know, code generation is not without problems...

I hope this clarifies some of the thinking that went into this.

split the application table from the entity table
-

Per discussions on YARN-3815, we need to split the application entities from
the main entity table into its own table (application).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3984) Rethink event column key issue

2015-07-30 Thread Vrushali C (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648491#comment-14648491
]

Vrushali C commented on YARN-3984:
--

To reach a conclusion on this:
If everyone/most folks are +1 for putting the event timestamp before the event
id itself {code} e! inverse_event_timestamp # eventid ? eventkey {code} I can
go ahead and create the patch.
Note that by doing so, we will *always* have to query for all event ids and all
timestamps regardless of the query (unless we know the exact timestamp).

If not, the other option is to put the event timestamp after the event id but
before the event key.{code} e! eventid # inverse_event_timestamp ? eventkey
{code}
In this option, we have the option of querying for a particular event id.

In both cases, we need to fetch all records, construct TimelineEvent objects
and sort them for chronological order.

Rethink event column key issue
--

Key: YARN-3984
URL: https://issues.apache.org/jira/browse/YARN-3984
Project: Hadoop YARN
Issue Type: Sub-task
Components: timelineserver
Reporter: Zhijie Shen
Assignee: Vrushali C
Fix For: YARN-2928

Currently, the event column key is event_id?info_key?timestamp, which is not
so friendly to fetching all the events of an entity and sorting them in a
chronologic order. IMHO, timestamp?event_id?info_key may be a better key
schema. I open this jira to continue the discussion about it which was
commented on YARN-3908.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3999) Add a timeout when drain the dispatcher


 [ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3999:
--
Attachment: YARN-3999.patch

 Add a timeout when drain the dispatcher
 ---

 Key: YARN-3999
 URL: https://issues.apache.org/jira/browse/YARN-3999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3999.patch, YARN-3999.patch


 If external systems like ATS, or ZK becomes very slow, draining all the 
 events take a lot of time. If this time becomes larger than 10 mins, all 
 applications will expire. We can add a timeout and stop the dispatcher even 
 if not all events are drained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher


[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648603#comment-14648603
 ] 

Hadoop QA commented on YARN-3999:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  17m 48s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 56s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 44s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 48s | The applied patch generated  1 
new checkstyle issues (total was 50, now 50). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 22s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m  3s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   2m  3s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  43m 41s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  88m 24s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestApplicationCleanup 
|
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | hadoop.yarn.server.resourcemanager.TestApplicationMasterService |
|   | hadoop.yarn.server.resourcemanager.TestRMAdminService |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens |
|   | hadoop.yarn.server.resourcemanager.TestClientRMService |
|   | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter |
|   | hadoop.yarn.server.resourcemanager.TestRMHA |
|   | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
|   | hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
|   | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748093/YARN-3999.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 88d8736 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8726/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8726/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8726/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8726/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8726/console |


This message was automatically generated.

 Add a timeout when drain the dispatcher
 ---

 Key: YARN-3999
 URL: https://issues.apache.org/jira/browse/YARN-3999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3999.patch, YARN-3999.patch


 If external systems like ATS, or ZK becomes very slow, draining all the 
 events take a lot of time. If this time becomes larger than 10 mins, all 
 applications will expire. We can add a timeout and stop the dispatcher even 
 if not all events are drained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3945) maxApplicationsPerUser is wrongly calculated

2015-07-30 Thread Nathan Roberts (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648418#comment-14648418
]

Nathan Roberts commented on YARN-3945:
--

Hi [~leftnoteasy]. Regarding minimum_user_limit_percent.
- I totally agree it is very confusing.
- I don't think we can change it in any significant way at this point without a
major configuration switch that clearly indicates you're getting different
behavior. I'm sure admins have built up clusters with this tuned in very
specific ways, a significant change wouldn't be compatible with their
expectations.

bq. User-limit is not a fairness mechanism to balance resources between users,
instead, it can lead to bad imbalance. One example is, if we set user-limit =
50, and there are 10 users running, we cannot manage how much resource can be
used by each user.
I don't really agree with this. It may not be doing an ideal job but I think
the intent is to introduce fairness between users. It's a progression from 0
being the most fair, and 100+ being more fifo. In your example it's trying to
get everyone 50% which isn't likely to happen so in this case it's going to
operate mostly fifo. If the intent is to be much more fair across the 10 users,
then a much smaller value would be appropriate.

bq. meaningful since #active-user is changing every minute, it is not a
predictable formula.
Since the scheduler can't predict what an application is going to request in
the future, I don't see how a predictable formula is even possible (ignoring
the possibility of taking away resources via in-queue preemption). It's not
great, but being fair to currently requesting users makes some bit of sense.

bq. Instead we may need to consider some notion like fair sharing:
user-limit-factor becomes max-resource-limit of each user, and
user-limit-percentage becomes something like guaranteed-concurrent-#user, when
#user guaranteed-concurrent-#user, rest users can only get idle shares.
user-limit-factor is the max-resource-limit of each user today, right? The
second one seems very hard to track. It seems like one of the initial users can
stay in the guaranteed set as long as he keeps requesting resources. This
doesn't seem very fair to the users only getting idle shares.

maxApplicationsPerUser is wrongly calculated

Key: YARN-3945
URL: https://issues.apache.org/jira/browse/YARN-3945
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 2.7.1
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Attachments: YARN-3945.20150728-1.patch, YARN-3945.20150729-1.patch

maxApplicationsPerUser is currently calculated based on the formula
{{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) *
userLimitFactor)}} but description of userlimit is
{quote}
Each queue enforces a limit on the percentage of resources allocated to a
user at any given time, if there is demand for resources. The user limit can
vary between a minimum and maximum value.{color:red} The the former (the
minimum value) is set to this property value {color} and the latter (the
maximum value) depends on the number of users who have submitted
applications. For e.g., suppose the value of this property is 25. If two
users have submitted applications to a queue, no single user can use more
than 50% of the queue resources. If a third user submits an application, no
single user can use more than 33% of the queue resources. With 4 or more
users, no user can use more than 25% of the queues resources. A value of 100
implies no user limits are imposed. The default is 100. Value is specified as
a integer.
{quote}
configuration related to minimum limit should not be made used in a formula
to calculate max applications for a user

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher


[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648484#comment-14648484
 ] 

Jian He commented on YARN-3999:
---

Uploaded a patch which adds the timeout on draining the dispatcher. The value 
is set to be half of the am-rm-expiry-time. 
Beyond that, I also changed the order of a couple of services which might take 
long time to flush the events on stop.

 Add a timeout when drain the dispatcher
 ---

 Key: YARN-3999
 URL: https://issues.apache.org/jira/browse/YARN-3999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3999.patch


 If external systems like ATS, or ZK becomes very slow, draining all the 
 events take a lot of time. If this time becomes larger than 10 mins, all 
 applications will expire. We can add a timeout and stop the dispatcher even 
 if not all events are drained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3984) Rethink event column key issue


[ 
https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648499#comment-14648499
 ] 

Li Lu commented on YARN-3984:
-

Hi [~vrushalic], one quick question. I'm a little bit confused by this:
bq. This would mean that we would never be able to query for a specific event.

Maybe here you're assuming that the timestamp information is missing for some 
of our use cases? Or else, because timestamp is one of the two parts of the id 
of timeline event, I'm not sure why we cannot directly locate that specific 
column? 

 Rethink event column key issue
 --

 Key: YARN-3984
 URL: https://issues.apache.org/jira/browse/YARN-3984
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Vrushali C
 Fix For: YARN-2928


 Currently, the event column key is event_id?info_key?timestamp, which is not 
 so friendly to fetching all the events of an entity and sorting them in a 
 chronologic order. IMHO, timestamp?event_id?info_key may be a better key 
 schema. I open this jira to continue the discussion about it which was 
 commented on YARN-3908.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3965) Add startup timestamp to nodemanager UI


 [ 
https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Zhiguo updated YARN-3965:
--
Attachment: YARN-3965-4.patch

 Add startup timestamp to nodemanager UI
 ---

 Key: YARN-3965
 URL: https://issues.apache.org/jira/browse/YARN-3965
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965-4.patch, 
 YARN-3965.patch


 We have startup timestamp for RM already, but don't for NM.
 Sometimes cluster operator modified configuration of all nodes and kicked off 
 command to restart all NMs.  He found out it's hard for him to check whether 
 all NMs are restarted.  Actually there's always some NMs didn't restart as he 
 expected, which leads to some error later due to inconsistent configuration.
 If we have startup timestamp for NM,  the operator could easily fetch it via 
 NM webservice and find out which NM didn't restart, and take mannaul action 
 for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4003) ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent

2015-07-30 Thread Carlo Curino (JIRA)

Carlo Curino created YARN-4003:
--

 Summary: ReservationQueue inherit getAMResourceLimit() from 
LeafQueue, but behavior is not consistent
 Key: YARN-4003
 URL: https://issues.apache.org/jira/browse/YARN-4003
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Carlo Curino


The inherited behavior from LeafQueue (limit AM % based on capacity) is not a 
good fit for ReservationQueue (that have highly dynamic capacity). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4003) ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent

2015-07-30 Thread Carlo Curino (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648693#comment-14648693
 ] 

Carlo Curino commented on YARN-4003:


For LeafQueue makes sense to compute the resources available for AMs based on 
capacity. 
For ReserationQueue however we rely on the fact that even if their capacity it 
is zero, jobs can run (as resources are likely to grow substantially soon). 

The attached patch, proposes to use the parent guaranteed capacity as an 
upper-bound of how many AMs we can run. 
This is clearly a loose constraints, but I don't know which other value would 
make sense. [~leftnoteasy], [~jianhe], [~subru], any thoughts ?




 ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is 
 not consistent
 

 Key: YARN-4003
 URL: https://issues.apache.org/jira/browse/YARN-4003
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Carlo Curino
 Attachments: YARN-4003.patch


 The inherited behavior from LeafQueue (limit AM % based on capacity) is not a 
 good fit for ReservationQueue (that have highly dynamic capacity). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher


[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648728#comment-14648728
 ] 

Hadoop QA commented on YARN-3999:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 34s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 43s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 36s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 54s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 23s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m  0s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 56s | Tests passed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |  42m 51s | Tests failed in 
hadoop-yarn-server-resourcemanager. |
| | |  83m 55s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.resourcemanager.TestClientRMService |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestRMHA |
|   | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens |
|   | hadoop.yarn.server.resourcemanager.TestApplicationCleanup |
|   | hadoop.yarn.server.resourcemanager.TestApplicationMasterService |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption |
|   | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
|   | hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler |
|   | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
|   | hadoop.yarn.server.resourcemanager.TestRMAdminService |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748093/YARN-3999.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 88d8736 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8730/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8730/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8730/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8730/console |


This message was automatically generated.

 Add a timeout when drain the dispatcher
 ---

 Key: YARN-3999
 URL: https://issues.apache.org/jira/browse/YARN-3999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3999.patch, YARN-3999.patch


 If external systems like ATS, or ZK becomes very slow, draining all the 
 events take a lot of time. If this time becomes larger than 10 mins, all 
 applications will expire. We can add a timeout and stop the dispatcher even 
 if not all events are drained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.


[ 
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648733#comment-14648733
 ] 

Hadoop QA commented on YARN-221:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  21m 39s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 3 new or modified test files. |
| {color:red}-1{color} | javac |   7m 38s | The applied patch generated  1  
additional warning messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   3m 15s | The applied patch generated  1 
new checkstyle issues (total was 212, now 212). |
| {color:red}-1{color} | whitespace |   1m 21s | The patch has 2  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 24s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   7m 40s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  22m 22s | Tests passed in 
hadoop-common. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 55s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   7m 17s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| {color:green}+1{color} | yarn tests |  52m 23s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | | 138m 35s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748108/YARN-221-6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 88d8736 |
| javac | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/diffJavacWarnings.txt
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/whitespace.txt
 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8728/console |


This message was automatically generated.

 NM should provide a way for AM to tell it not to aggregate logs.
 

 Key: YARN-221
 URL: https://issues.apache.org/jira/browse/YARN-221
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation, nodemanager
Reporter: Robert Joseph Evans
Assignee: Ming Ma
 Attachments: YARN-221-6.patch, YARN-221-trunk-v1.patch, 
 YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, 
 YARN-221-trunk-v5.patch


 The NodeManager should provide a way for an AM to tell it that either the 
 logs should not be aggregated, that they should be aggregated with a high 
 priority, or that they should be aggregated but with a lower priority.  The 
 AM should be able to do this in the ContainerLaunch context to provide a 
 default value, but should also be able to update the value when the container 
 is released.
 This would allow for the NM to not aggregate logs in some cases, and avoid 
 connection to the NN at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3965) Add startup timestamp to nodemanager UI


[ 
https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648706#comment-14648706
 ] 

Hadoop QA commented on YARN-3965:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  16m  6s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 48s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 46s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 37s | The applied patch generated  1 
new checkstyle issues (total was 47, now 48). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 22s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 14s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   6m  8s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  44m  1s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.nodemanager.TestDeletionService |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748110/YARN-3965-4.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 88d8736 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8729/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8729/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8729/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8729/console |


This message was automatically generated.

 Add startup timestamp to nodemanager UI
 ---

 Key: YARN-3965
 URL: https://issues.apache.org/jira/browse/YARN-3965
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965-4.patch, 
 YARN-3965.patch


 We have startup timestamp for RM already, but don't for NM.
 Sometimes cluster operator modified configuration of all nodes and kicked off 
 command to restart all NMs.  He found out it's hard for him to check whether 
 all NMs are restarted.  Actually there's always some NMs didn't restart as he 
 expected, which leads to some error later due to inconsistent configuration.
 If we have startup timestamp for NM,  the operator could easily fetch it via 
 NM webservice and find out which NM didn't restart, and take mannaul action 
 for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.

2015-07-30 Thread Ming Ma (JIRA)

[
https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ming Ma updated YARN-221:
-
Attachment: YARN-221-6.patch

[~xgong] and others, here is the draft patch based on the new design. Besides
the above discussions,

* If the application specifies an invalid log aggregation policy class, the
current implementation will fallback to the default policy instead of the
failing the application. Alternative approach is to have NM fail the
application instead.
* For each new application, a new policy object will be created and used only
by that application. This should be ok from memory footprint as well as runtime
perf point of view. Alternative approach is to have applications share the same
policy object if they use the same policy class and same policy parameters.

NM should provide a way for AM to tell it not to aggregate logs.

Key: YARN-221
URL: https://issues.apache.org/jira/browse/YARN-221
Project: Hadoop YARN
Issue Type: Sub-task
Components: log-aggregation, nodemanager
Reporter: Robert Joseph Evans
Assignee: Ming Ma
Attachments: YARN-221-6.patch, YARN-221-trunk-v1.patch,
YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch,
YARN-221-trunk-v5.patch

The NodeManager should provide a way for an AM to tell it that either the
logs should not be aggregated, that they should be aggregated with a high
priority, or that they should be aggregated but with a lower priority. The
AM should be able to do this in the ContainerLaunch context to provide a
default value, but should also be able to update the value when the container
is released.
This would allow for the NM to not aggregate logs in some cases, and avoid
connection to the NN at all.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

2015-07-30 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula reassigned YARN-4000:
--

Assignee: Brahma Reddy Battula

 RM crashes with NPE if leaf queue becomes parent queue during restart
 -

 Key: YARN-4000
 URL: https://issues.apache.org/jira/browse/YARN-4000
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler, resourcemanager
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Brahma Reddy Battula

 This is a similar situation to YARN-2308.  If an application is active in 
 queue A and then the RM restarts with a changed capacity scheduler 
 configuration where queue A becomes a parent queue to other subqueues then 
 the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4001) normalizeHostName takes too much of execution time

Hong Zhiguo created YARN-4001:
-

 Summary: normalizeHostName takes too much of execution time
 Key: YARN-4001
 URL: https://issues.apache.org/jira/browse/YARN-4001
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor


For each NodeHeartbeatRequest, NetUtils.normalizeHostName is called under a 
lock.  I did profiling for a very large cluster and found out 
NetUtils.normalizeHostName takes most execution time of 
ResourceTrackerService.nodeHeartbeat(...).
We'd better have an option to use raw IP (plus port) as the Node identity to 
scale for large clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent

Hong Zhiguo created YARN-4002:
-

 Summary: make ResourceTrackerService.nodeHeartbeat more concurrent
 Key: YARN-4002
 URL: https://issues.apache.org/jira/browse/YARN-4002
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Critical


We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By design 
the method ResourceTrackerService.nodeHeartbeat should be concurrent enough to 
scale for large clusters.
But we have a BIG log in NodesListManager.isValidNode which I think it's 
unnecessary.
First, the fields includes and excludes of HostsFileReader are only updated 
on refresh nodes.  All RPC threads handling node heartbeats are only readers. 
 So RWLock could be used to have alow concurrently access by RPC threads.
Second, since he fields includes and excludes of HostsFileReader are always 
updated by reference assignment, which is atomic in Java, the reader side 
lock could just be skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3965) Add startup timestamp to nodemanager UI


[ 
https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648666#comment-14648666
 ] 

Hong Zhiguo commented on YARN-3965:
---

Hi, [~jlowe], version 4 of the patch is uploaded with 2 changes:
1) NodeInfo.getNmStartupTime  -  NodeInfo.getNMStartupTime
2) removed the final qualifier on NodeManager.nmStartupTime to avoid 
checkstyle error:
   {code}
   Name 'nmStartupTime' must match pattern '^[A-Z][A-Z0-9]*(_[A-Z0-9]+)*$'
   {code}
   It's private with Getter. So it's OK not to be final.

 Add startup timestamp to nodemanager UI
 ---

 Key: YARN-3965
 URL: https://issues.apache.org/jira/browse/YARN-3965
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Hong Zhiguo
Assignee: Hong Zhiguo
Priority: Minor
 Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965-4.patch, 
 YARN-3965.patch


 We have startup timestamp for RM already, but don't for NM.
 Sometimes cluster operator modified configuration of all nodes and kicked off 
 command to restart all NMs.  He found out it's hard for him to check whether 
 all NMs are restarted.  Actually there's always some NMs didn't restart as he 
 expected, which leads to some error later due to inconsistent configuration.
 If we have startup timestamp for NM,  the operator could easily fetch it via 
 NM webservice and find out which NM didn't restart, and take mannaul action 
 for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4003) ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent

2015-07-30 Thread Carlo Curino (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlo Curino updated YARN-4003:
---
Attachment: YARN-4003.patch

Simple proposal of a patch, where ReservationQueue overrides the LeafQueue 
behavior.

 ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is 
 not consistent
 

 Key: YARN-4003
 URL: https://issues.apache.org/jira/browse/YARN-4003
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Carlo Curino
 Attachments: YARN-4003.patch


 The inherited behavior from LeafQueue (limit AM % based on capacity) is not a 
 good fit for ReservationQueue (that have highly dynamic capacity). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3528) Tests with 12345 as hard-coded port break jenkins

2015-07-30 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3528:
---
Attachment: YARN-3528-005.patch

Hmm..[~varun_saxena] thanks a lot for your review..Attached the patch to 
address your comments..( Earlier all are passed locally )

 Tests with 12345 as hard-coded port break jenkins
 -

 Key: YARN-3528
 URL: https://issues.apache.org/jira/browse/YARN-3528
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
 Environment: ASF Jenkins
Reporter: Steve Loughran
Assignee: Brahma Reddy Battula
Priority: Blocker
  Labels: test
 Attachments: YARN-3528-002.patch, YARN-3528-003.patch, 
 YARN-3528-004.patch, YARN-3528-005.patch, YARN-3528.patch


 A lot of the YARN tests have hard-coded the port 12345 for their services to 
 come up on.
 This makes it impossible to have scheduled or precommit tests to run 
 consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
 appear to get ignored completely.
 A quick grep of 12345 shows up many places in the test suite where this 
 practise has developed.
 * All {{BaseContainerManagerTest}} subclasses
 * {{TestNodeManagerShutdown}}
 * {{TestContainerManager}}
 + others
 This needs to be addressed through portscanning and dynamic port allocation. 
 Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins


[ 
https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648731#comment-14648731
 ] 

Hadoop QA commented on YARN-3528:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   8m  8s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 8 new or modified test files. |
| {color:green}+1{color} | javac |   7m 42s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 40s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 18s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m  6s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |  23m  2s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | yarn tests |   6m  1s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  51m 55s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.server.nodemanager.TestNodeStatusUpdater |
|   | hadoop.yarn.server.nodemanager.TestNodeManagerShutdown |
|   | hadoop.yarn.server.nodemanager.TestNodeManagerResync |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748115/YARN-3528-005.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / c5caa25 |
| hadoop-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8731/artifact/patchprocess/testrun_hadoop-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8731/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8731/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8731/console |


This message was automatically generated.

 Tests with 12345 as hard-coded port break jenkins
 -

 Key: YARN-3528
 URL: https://issues.apache.org/jira/browse/YARN-3528
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
 Environment: ASF Jenkins
Reporter: Steve Loughran
Assignee: Brahma Reddy Battula
Priority: Blocker
  Labels: test
 Attachments: YARN-3528-002.patch, YARN-3528-003.patch, 
 YARN-3528-004.patch, YARN-3528-005.patch, YARN-3528.patch


 A lot of the YARN tests have hard-coded the port 12345 for their services to 
 come up on.
 This makes it impossible to have scheduled or precommit tests to run 
 consistently on the ASF jenkins hosts. Instead the tests fail regularly and 
 appear to get ignored completely.
 A quick grep of 12345 shows up many places in the test suite where this 
 practise has developed.
 * All {{BaseContainerManagerTest}} subclasses
 * {{TestNodeManagerShutdown}}
 * {{TestContainerManager}}
 + others
 This needs to be addressed through portscanning and dynamic port allocation. 
 Please can someone do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission

2015-07-30 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648686#comment-14648686
 ] 

Bibin A Chundatt commented on YARN-3940:


Hi [~leftnoteasy]

Thank you for review comments.

{quote}
We should check usage as I mentioned at: 
https://issues.apache.org/jira/browse/YARN-3940?focusedCommentId=14633876page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14633876.
{quote}
Will check how handle this too.

{quote}
 we may need to consider how to deal with node label update, currently, if we 
change labels on a node, all containers running on the node will be killed. I 
suggest to clear think about both of the problem before moving forward.
{quote}

As i understand the below cases containers shouldn't be killed

# Running containers of applications submitted for default partition  on 
partition with label incase of exclusivity(false)
# when queue is having access to new label / Node

Any other case ? 

Can we move second part to separate jira for discussion ?
Thoughts? Please do correct me if i am wrong.

 Application moveToQueue should check NodeLabel permission 
 --

 Key: YARN-3940
 URL: https://issues.apache.org/jira/browse/YARN-3940
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch


 Configure capacity scheduler 
 Configure node label an submit application {{queue=A Label=X}}
 Move application to queue {{B}} and x is not having access
 {code}
 2015-07-20 19:46:19,626 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application attempt appattempt_1437385548409_0005_01 released container 
 container_e08_1437385548409_0005_01_02 on node: host: 
 host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
 used=memory:512, vCores:1 with event: KILL
 2015-07-20 19:46:20,970 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Invalid resource ask by application appattempt_1437385548409_0005_01
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, queue=b1 doesn't have permission to access all labels in 
 resource request. labelExpression of resource request=x. Queue labels=y
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
 at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 {code}
 Same exception will be thrown till *heartbeat timeout*
 Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648758#comment-14648758
 ] 

Hudson commented on YARN-433:
-

FAILURE: Integrated in Hadoop-trunk-Commit #8249 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8249/])
YARN-433. When RM is catching up with node updates then it should not expire 
acquired containers. Contributed by Xuan Gong (zxu: rev 
ab80e277039a586f6d6259b2511ac413e29ea4f8)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java


 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, 
 YARN-433.4.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-30 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-433:
---
Fix Version/s: 2.8.0

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Fix For: 2.8.0

 Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, 
 YARN-433.4.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers

2015-07-30 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648761#comment-14648761
 ] 

zhihai xu commented on YARN-433:


Yes, thanks [~xgong]! I committed this to trunk and branch-2.

 When RM is catching up with node updates then it should not expire acquired 
 containers
 --

 Key: YARN-433
 URL: https://issues.apache.org/jira/browse/YARN-433
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, 
 YARN-433.4.patch


 RM expires containers that are not launched within some time of being 
 allocated. The default is 10mins. When an RM is not keeping up with node 
 updates then it may not be aware of new launched containers. If the expire 
 thread fires for such containers then the RM can expire them even though they 
 may have launched.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3983) Make CapacityScheduler to easier extend application allocation logic


[ 
https://issues.apache.org/jira/browse/YARN-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648785#comment-14648785
 ] 

Jian He commented on YARN-3983:
---

- I suggest  not set the state implicitly in the constructor. It’s quite 
confusing which constructor indicates which state. Setting it explicitly by 
caller makes code easier to read.
{code}
  public ContainerAllocation(RMContainer containerToBeUnreserved) {
this(containerToBeUnreserved, null, AllocationState.QUEUE_SKIPPED);
  }
  public ContainerAllocation(RMContainer containerToBeUnreserved,
  Resource resourceToBeAllocated) {
this(containerToBeUnreserved, resourceToBeAllocated,
AllocationState.SUCCEEDED);
  }
{code}
- reserved container returns SUCCEEDED state?
{code}
ContainerAllocation result = new ContainerAllocation(null, 
request.getCapability()); result.reserved = true; result.containerNodeType = 
type;
{code}
Earlier below code will not be invoked for reserved container, now it gets 
invoked 
{code}
 if (allocationResult.state == AllocationState.SUCCEEDED) {
  // Don't reset scheduling opportunities for offswitch assignments
  // otherwise the app will be delayed for each non-local assignment.
  // This helps apps with many off-cluster requests schedule faster.
  if (allocationResult.containerNodeType != NodeType.OFF_SWITCH) {
if (LOG.isDebugEnabled()) {
  LOG.debug(Resetting scheduling opportunities);
}
application.resetSchedulingOpportunities(priority);
  }
  
  // Non-exclusive scheduling opportunity is different: we need reset
  // it every time to make sure non-labeled resource request will be
  // most likely allocated on non-labeled nodes first.
  
application.resetMissedNonPartitionedRequestSchedulingOpportunity(priority);
}
{code}
- AllocationState#SUCCEEDED - ALLOCTED.
- reserved boolean flag can be changed to be Allocateion#RESERVED state.
- CSAssignment#NULL_ASSIGNMENT not used, remove 
- comments does not match method name
{code} * doAllocation needs to handle following stuffs: {code} .
- Below code was originally outside of the priorities loop
{code}
if (SchedulerAppUtils.isBlacklisted(application, node, LOG)) {   return 
ContainerAllocation.APP_SKIPPED; }
{code} -  below code can be changed to use null check ?
{code}
if (Resources.greaterThan(rc, clusterResource, 
assigned.getResourceToBeAllocated(), Resources.none())) {

{code}
- move the for loop into {{applicationContainerAllocator.allocate}} method
{code}
for (Priority priority : getPriorities()) {
{code}

- return ContainerAllocation.QUEUE_SKIPPED, though logically correct, 
semantically is incorrect, it should return Priority_Skipped
{code}
  private ContainerAllocation assignNodeLocalContainers(
  Resource clusterResource, ResourceRequest nodeLocalResourceRequest,
  FiCaSchedulerNode node, Priority priority, RMContainer reservedContainer,
  SchedulingMode schedulingMode, ResourceLimits currentResoureLimits) {
if (canAssign(priority, node, NodeType.NODE_LOCAL, reservedContainer)) {
  return assignContainer(clusterResource, node, priority,
  nodeLocalResourceRequest, NodeType.NODE_LOCAL, reservedContainer,
  schedulingMode, currentResoureLimits);
}
return ContainerAllocation.QUEUE_SKIPPED;
  }

  // check if the resource request can access the label
if (!SchedulerUtils.checkResourceRequestMatchingNodePartition(request,
node.getPartition(), schedulingMode)) {
  // this is a reserved container, but we cannot allocate it now according
  // to label not match. This can be caused by node label changed
  // We should un-reserve this container.
  return new ContainerAllocation(rmContainer);
}
{code}
- APP_SKIPPED? original code seems skipping the priority 
{code}
// Does the application need this resource?
if (allocatedContainer == null) {
  // Skip this app if we failed to allocate.
  ContainerAllocation ret =
  new ContainerAllocation(allocationResult.containerToBeUnreserved);
  ret.state = AllocationState.APP_SKIPPED;
  return ret;
}
{code}



 Make CapacityScheduler to easier extend application allocation logic
 

 Key: YARN-3983
 URL: https://issues.apache.org/jira/browse/YARN-3983
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-3983.1.patch, YARN-3983.2.patch


 While working on YARN-1651 (resource allocation for increasing container), I 
 found it is very hard to extend existing CapacityScheduler resource 
 allocation logic to support different types of resource allocation.
 For example, there's a lot of differences between increasing a container and 
 allocating a

[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher

2015-07-30 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648786#comment-14648786
 ] 

Xuan Gong commented on YARN-3999:
-

Can we move 
{code}
transitionToStandby(false);
{code}
before 
{code}
super.serviceStop();
{code}
in that case, when we shut dow the RM, we transit RM to standby first (stop all 
the active services), then stop all the alway-on services ?

 Add a timeout when drain the dispatcher
 ---

 Key: YARN-3999
 URL: https://issues.apache.org/jira/browse/YARN-3999
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-3999.patch, YARN-3999.patch


 If external systems like ATS, or ZK becomes very slow, draining all the 
 events take a lot of time. If this time becomes larger than 10 mins, all 
 applications will expire. We can add a timeout and stop the dispatcher even 
 if not all events are drained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3906) split the application table from the entity table


 [ 
https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-3906:
--
Attachment: YARN-3906-YARN-2928.001.patch

v.1 patch posted.

The application table is nearly identical to the entity table, except that some 
redundant information is omitted (e.g. entity type and entity id). The unit 
tests probably could be refactored a little more, but wanted to get it reviewed.

 split the application table from the entity table
 -

 Key: YARN-3906
 URL: https://issues.apache.org/jira/browse/YARN-3906
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3906-YARN-2928.001.patch


 Per discussions on YARN-3815, we need to split the application entities from 
 the main entity table into its own table (application).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS


[ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648151#comment-14648151
 ] 

Hadoop QA commented on YARN-3978:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  18m 14s | Pre-patch trunk has 6 extant 
Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 6 new or modified test files. |
| {color:green}+1{color} | javac |   7m 43s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 40s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 41s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 19s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 31s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 57s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 21s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   0m 24s | Tests passed in 
hadoop-yarn-server-common. |
| {color:green}+1{color} | yarn tests |  52m 15s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  97m  1s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748033/YARN-3978.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 91b42e7 |
| Pre-patch Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html
 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8719/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8719/console |


This message was automatically generated.

 Configurably turn off the saving of container info in Generic AHS
 -

 Key: YARN-3978
 URL: https://issues.apache.org/jira/browse/YARN-3978
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver, yarn
Affects Versions: 2.8.0, 2.7.1
Reporter: Eric Payne
Assignee: Eric Payne
 Attachments: YARN-3978.001.patch, YARN-3978.002.patch, 
 YARN-3978.003.patch


 Depending on how each application's metadata is stored, one week's worth of 
 data stored in the Generic Application History Server's database can grow to 
 be almost a terabyte of local disk space. In order to alleviate this, I 
 suggest that there is a need for a configuration option to turn off saving of 
 non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-30 Thread zhangyubiao (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647275#comment-14647275
 ] 

zhangyubiao commented on YARN-3979:
---

I  find that the CPU and load is high  because of we use crontab to copy the RM 
Logs。
Today we stop the copy ,the CPU and load become normal 。

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM

2015-07-30 Thread zhangyubiao (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647303#comment-14647303
 ] 

zhangyubiao commented on YARN-3979:
---

I send you RM Logs just now 

 Am in ResourceLocalizationService hang 10 min cause RM kill  AM
 ---

 Key: YARN-3979
 URL: https://issues.apache.org/jira/browse/YARN-3979
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.2.0
 Environment: CentOS 6.5  Hadoop-2.2.0
Reporter: zhangyubiao
 Attachments: ERROR103.log


 2015-07-27 02:46:17,348 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Created localizer for container_1437735375558
 _104282_01_01
 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
 Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE)
 2015-07-27 02:56:18,510 INFO 
 SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
  Authorization successful for appattempt_1437735375558_104282_0
 1 (auth:TOKEN) for protocol=interface 
 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3232) Some application states are not necessarily exposed to users

2015-07-30 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647319#comment-14647319
 ] 

Varun Saxena commented on YARN-3232:


I mean RM can change these states while returning the application report.

 Some application states are not necessarily exposed to users
 

 Key: YARN-3232
 URL: https://issues.apache.org/jira/browse/YARN-3232
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Varun Saxena

 application NEW_SAVING and SUBMITTED states are not necessarily exposed to 
 users as they mostly internal to the system, transient and not user-facing. 
 We may deprecate these two states and remove them from the web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3232) Some application states are not necessarily exposed to users

2015-07-30 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647315#comment-14647315
 ] 

Varun Saxena commented on YARN-3232:


[~jianhe], what about CLI ? I guess we should not show these states even there.
RM can internally change NEW_SAVING and SUBMITTED states to NEW.

 Some application states are not necessarily exposed to users
 

 Key: YARN-3232
 URL: https://issues.apache.org/jira/browse/YARN-3232
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Varun Saxena

 application NEW_SAVING and SUBMITTED states are not necessarily exposed to 
 users as they mostly internal to the system, transient and not user-facing. 
 We may deprecate these two states and remove them from the web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3887) Support for changing Application priority during runtime

2015-07-30 Thread Sunil G (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3887:
--
Attachment: 0003-YARN-3887.patch

Thank you [~jianhe] and [~rohithsharma] for the comments. Updating a latest 
patch addressing comments.

 Support for changing Application priority during runtime
 

 Key: YARN-3887
 URL: https://issues.apache.org/jira/browse/YARN-3887
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Sunil G
Assignee: Sunil G
 Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch, 
 0003-YARN-3887.patch


 After YARN-2003, adding support to change priority of an application after 
 submission. This ticket will handle the server side implementation for same.
 A new RMAppEvent will be created to handle this, and will be common for all 
 schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery


[ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647920#comment-14647920
 ] 

Wangda Tan commented on YARN-3971:
--

Looks good, committing... will add some comment to the change before commit.

 Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
 recovery
 --

 Key: YARN-3971
 URL: https://issues.apache.org/jira/browse/YARN-3971
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
 0003-YARN-3971.patch, 0004-YARN-3971.patch


 Steps to reproduce 
 # Create label x,y
 # Delete label x,y
 # Create label x,y add capacity scheduler xml for labels x and y too
 # Restart RM 
  
 Both RM will become Standby.
 Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
 {code}
 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
 Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
 state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
 queue=a1 is using this label. Please remove label on queue before remove the 
 label
 java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
 label. Please remove label on queue before remove the label
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS


[ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647873#comment-14647873
 ] 

Jason Lowe commented on YARN-3942:
--

I think so.  We can probably create a new TimelineClient that stores to HDFS 
files based on how yarn.timeline-service.entity-file-store.summary-entity-types 
is configured.  However I'm not sure if YARN can automatically replace timeline 
clients being requested with this one, as the client needs to know the 
application ID when putting domains and the application attempt ID when posting 
entities.  So one approach is to have YARN provide something like a 
TimelineEntityFileClient, which is a TimelineClient, but Tez and other app 
frameworks would have to explicitly ask for it themselves and provide the 
appropriate application ID/app attempt ID upon construction of the client.

Let me know if that sounds OK if there's an idea of how YARN can seamlessly 
provide this alternative client instead of TimelineClientImpl when 
TimelineClient.createTimelineClient is called.

 Timeline store to read events from HDFS
 ---

 Key: YARN-3942
 URL: https://issues.apache.org/jira/browse/YARN-3942
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3942.001.patch


 This adds a new timeline store plugin that is intended as a stop-gap measure 
 to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
 v2.  The intent of this plugin is to provide a workable solution for running 
 the Tez UI against the timeline server on a large-scale clusters running many 
 thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers

2015-07-30 Thread Karthik Kambatla (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3997:
---
Assignee: Arun Suresh
Target Version/s: 2.8.0
Priority: Critical  (was: Major)

 An Application requesting multiple core containers can't preempt running 
 application made of single core containers
 ---

 Key: YARN-3997
 URL: https://issues.apache.org/jira/browse/YARN-3997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines
Reporter: Dan Shechter
Assignee: Arun Suresh
Priority: Critical

 When our cluster is configures with preemption, and is fully loaded with an 
 application consuming 1-core containers, it will not kill off these 
 containers when a new application kicks in requesting, for example 4 core 
 containers.
 When the second application attempts to us 1-core containers as well, 
 preemption proceeds as planned and everything works properly.
 It is my assumptiom, that the fair-scheduler, while recognizing it needs to 
 kill off some container to make room for the new application, fails to find a 
 SINGLE container satisfying the request for a 4-core container (since all 
 existing containers are 1-core containers), and isn't smart enough to 
 realize it needs to kill off 4 single-core containers (in this case) on a 
 single node, for the new application to be able to proceed...
 The exhibited affect is that the new application is hung indefinitely and 
 never gets the resources it requires.
 This can easily be replicated with any yarn application.
 Our goto scenario in this case is running pyspark with 1-core executors 
 (containers) while trying to launch h20.ai framework which INSISTS on having 
 at least 4 cores per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1643) Make ContainersMonitor can support change monitoring size of an allocated container in NM side


[ 
https://issues.apache.org/jira/browse/YARN-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647916#comment-14647916
 ] 

Hadoop QA commented on YARN-1643:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 14s | Findbugs (version ) appears to 
be broken on YARN-1197. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   7m 42s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 41s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 21s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 20s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  2s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 20s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 31s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 14s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:red}-1{color} | yarn tests |   6m 14s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  42m 42s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748018/YARN-1643-YARN-1197.7.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | YARN-1197 / cb95662 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8716/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8716/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8716/console |


This message was automatically generated.

 Make ContainersMonitor can support change monitoring size of an allocated 
 container in NM side
 --

 Key: YARN-1643
 URL: https://issues.apache.org/jira/browse/YARN-1643
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Wangda Tan
Assignee: MENG DING
 Attachments: YARN-1643-YARN-1197.4.patch, 
 YARN-1643-YARN-1197.5.patch, YARN-1643-YARN-1197.6.patch, 
 YARN-1643-YARN-1197.7.patch, YARN-1643.1.patch, YARN-1643.2.patch, 
 YARN-1643.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1643) Make ContainersMonitor can support change monitoring size of an allocated container in NM side

2015-07-30 Thread MENG DING (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647926#comment-14647926
 ] 

MENG DING commented on YARN-1643:
-

The failed test does not seem to be related.

 Make ContainersMonitor can support change monitoring size of an allocated 
 container in NM side
 --

 Key: YARN-1643
 URL: https://issues.apache.org/jira/browse/YARN-1643
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Wangda Tan
Assignee: MENG DING
 Attachments: YARN-1643-YARN-1197.4.patch, 
 YARN-1643-YARN-1197.5.patch, YARN-1643-YARN-1197.6.patch, 
 YARN-1643-YARN-1197.7.patch, YARN-1643.1.patch, YARN-1643.2.patch, 
 YARN-1643.3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery


 [ 
https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3971:
-
Attachment: 0005-YARN-3971.patch

Attached latest patch committed to trunk.

 Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel 
 recovery
 --

 Key: YARN-3971
 URL: https://issues.apache.org/jira/browse/YARN-3971
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
Priority: Critical
 Fix For: 2.8.0

 Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 
 0003-YARN-3971.patch, 0004-YARN-3971.patch, 0005-YARN-3971.patch


 Steps to reproduce 
 # Create label x,y
 # Delete label x,y
 # Create label x,y add capacity scheduler xml for labels x and y too
 # Restart RM 
  
 Both RM will become Standby.
 Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}}
 {code}
 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: 
 Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in 
 state STARTED; cause: java.io.IOException: Cannot remove label=x, because 
 queue=a1 is using this label. Please remove label on queue before remove the 
 label
 java.io.IOException: Cannot remove label=x, because queue=a1 is using this 
 label. Please remove label on queue before remove the label
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118)
 at 
 org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232)
 at 
 org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587)
 at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832)
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3232) Some application states are not necessarily exposed to users


[ 
https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647957#comment-14647957
 ] 

Jian He commented on YARN-3232:
---

agree

 Some application states are not necessarily exposed to users
 

 Key: YARN-3232
 URL: https://issues.apache.org/jira/browse/YARN-3232
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Varun Saxena

 application NEW_SAVING and SUBMITTED states are not necessarily exposed to 
 users as they mostly internal to the system, transient and not user-facing. 
 We may deprecate these two states and remove them from the web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission


[ 
https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647955#comment-14647955
 ] 

Wangda Tan commented on YARN-3940:
--

[~bibinchundatt],
I took a look at the patch, it checked app's node-label-expression AND queue's 
accessible-node-label, which is not enough to me. We should check usage as I 
mentioned at: 
https://issues.apache.org/jira/browse/YARN-3940?focusedCommentId=14633876page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14633876.

Actually I don't have a clear idea about how to solve this problem as well. 
Another related problem to this is, we may need to consider how to deal with 
node label update, currently, if we change labels on a node, all containers 
running on the node will be killed. I suggest to clear think about both of the 
problem before moving forward.

 Application moveToQueue should check NodeLabel permission 
 --

 Key: YARN-3940
 URL: https://issues.apache.org/jira/browse/YARN-3940
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Bibin A Chundatt
Assignee: Bibin A Chundatt
 Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch


 Configure capacity scheduler 
 Configure node label an submit application {{queue=A Label=X}}
 Move application to queue {{B}} and x is not having access
 {code}
 2015-07-20 19:46:19,626 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
  Application attempt appattempt_1437385548409_0005_01 released container 
 container_e08_1437385548409_0005_01_02 on node: host: 
 host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 
 used=memory:512, vCores:1 with event: KILL
 2015-07-20 19:46:20,970 WARN 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
 Invalid resource ask by application appattempt_1437385548409_0005_01
 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
 resource request, queue=b1 doesn't have permission to access all labels in 
 resource request. labelExpression of resource request=x. Queue labels=y
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515)
 at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 at 
 org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168)
 {code}
 Same exception will be thrown till *heartbeat timeout*
 Then application state will be updated to *FAILED*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS


[ 
https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648057#comment-14648057
 ] 

Jason Lowe commented on YARN-3942:
--

The logs are created based on app attempt.  It helps avoid the split-brain, 
double-writer issue where the previous attempt is still running when the RM 
expires it (e.g.: due to network cut) and decides to launch another.  The files 
are stored and looked up in a directory that is named after the application ID, 
and the entity files within that directory are stored based on application 
attempt ID.  I don't think the latter is crucial to use the app attempt ID and 
the reader is not relying on the attempt ID from those files, but it was a 
simple way to avoid colliding with previous attempts and having the reader 
process the files in attempt order.

 Timeline store to read events from HDFS
 ---

 Key: YARN-3942
 URL: https://issues.apache.org/jira/browse/YARN-3942
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-3942.001.patch


 This adds a new timeline store plugin that is intended as a stop-gap measure 
 to mitigate some of the issues we've seen with ATS v1 while waiting for ATS 
 v2.  The intent of this plugin is to provide a workable solution for running 
 the Tez UI against the timeline server on a large-scale clusters running many 
 thousands of jobs per day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime