[jira] [Created] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers
Dan Shechter created YARN-3997: -- Summary: An Application requesting multiple core containers can't preempt running application made of single core containers Key: YARN-3997 URL: https://issues.apache.org/jira/browse/YARN-3997 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.1 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines Reporter: Dan Shechter When our cluster is configures with preemption, and is fully loaded with an application consuming 1-core containers, it will not kill off these containers when a new application kicks in requesting, for example 4 core containers. When the second application attempts to us 1-core containers as well, preemption proceeds as planned and everything works properly. It is my assumptiom, that the fair-scheduler, while recognizing it needs to kill off some container to make room for the new application, fails to find a SINGLE container satisfying the request for a 4-core container (since all existing containers are 1-core containers), and isn't smart enough to realize it needs to kill off 4 single-core containers (in this case) on a single node, for the new application to be able to proceed... The exhibited affect is that the new application is hung indefinitely and never gets the resources it requires. This can easily be replicated with any yarn application. Our goto scenario in this case is running pyspark with 1-core executors (containers) while trying to launch h20.ai framework which INSISTS on having at least 4 cores per container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647507#comment-14647507 ] Rohith Sharma K S commented on YARN-3250: - bq. I think one problem is that if there's ever a value set in state-store, RM cannot pick up the value using the config any more I see, I agree. configuration files would become stale after one restart/switch. How about having command that read yarn-site.xml specific configurations very likely similar to {{./yarn rmadmin refreshAdminAcls}}.This read *yarn.admin.acl* from yarn-site.xml configuration when refreshAdminAcls invoked. Similar line, setting cluster-max-application-priorty would be {{./yarn rmadmin refreshClusterMaxPriority}} or {{./yarn rmadmin refreshClusterPriority}}. Thoughts? bq. How about yarn application ApplicationId -setPriority priority ? Make sense. Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647580#comment-14647580 ] Hudson commented on YARN-3950: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #261 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/261/]) YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. Contributed by Robert Kanter (jlowe: rev 2b2bd9214604bc2e14e41e08d30bf86f512151bd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java Add unique YARN_SHELL_ID environment variable to DistributedShell - Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647582#comment-14647582 ] Hudson commented on YARN-3919: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #261 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/261/]) YARN-3919. NPEs' while stopping service after exception during CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) (rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Trivial Fix For: 2.8.0 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647578#comment-14647578 ] Hudson commented on YARN-2768: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #261 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/261/]) YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/CHANGES.txt Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.8.0 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647585#comment-14647585 ] mujunchao commented on YARN-3857: - Thanks Devar for reviewing, 1. I think ClientToAMTokenSecretManagerInRM#hasMasterKey() is necessary, in this case the value is null, so i new function to recognize it. 2. have fixed. 3. As i use annotation @VisibleForTesting,in my ide, need to import com.google.common.annotations.VisibleForTesting. Memory leak in ResourceManager with SIMPLE mode --- Key: YARN-3857 URL: https://issues.apache.org/jira/browse/YARN-3857 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: mujunchao Assignee: mujunchao Priority: Critical Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, hadoop-yarn-server-resourcemanager.patch We register the ClientTokenMasterKey to avoid client may hold an invalid ClientToken after RM restarts. In SIMPLE mode, we register PairApplicationAttemptId, null , But we never remove it from HashMap, as unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mujunchao updated YARN-3857: Attachment: YARN-3857-3.patch fix the comments and format the code. Memory leak in ResourceManager with SIMPLE mode --- Key: YARN-3857 URL: https://issues.apache.org/jira/browse/YARN-3857 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: mujunchao Assignee: mujunchao Priority: Critical Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, hadoop-yarn-server-resourcemanager.patch We register the ClientTokenMasterKey to avoid client may hold an invalid ClientToken after RM restarts. In SIMPLE mode, we register PairApplicationAttemptId, null , But we never remove it from HashMap, as unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647503#comment-14647503 ] Hadoop QA commented on YARN-3990: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 12s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:red}-1{color} | javac | 7m 37s | The applied patch generated 1 additional warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 20s | The applied patch generated 1 release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 46s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 52m 38s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 90m 31s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747961/0002-YARN-3990.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ddc867ce | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/8714/artifact/patchprocess/diffJavacWarnings.txt | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/8714/artifact/patchprocess/patchReleaseAuditProblems.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8714/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8714/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8714/console | This message was automatically generated. AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch, 0002-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647497#comment-14647497 ] Hudson commented on YARN-2768: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #272 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/272/]) YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.8.0 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647499#comment-14647499 ] Hudson commented on YARN-3950: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #272 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/272/]) YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. Contributed by Robert Kanter (jlowe: rev 2b2bd9214604bc2e14e41e08d30bf86f512151bd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java Add unique YARN_SHELL_ID environment variable to DistributedShell - Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647501#comment-14647501 ] Hudson commented on YARN-3919: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #272 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/272/]) YARN-3919. NPEs' while stopping service after exception during CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) (rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Trivial Fix For: 2.8.0 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647568#comment-14647568 ] mujunchao commented on YARN-3857: - thanks for your reviewing, have fixed the comments and the indentation. Memory leak in ResourceManager with SIMPLE mode --- Key: YARN-3857 URL: https://issues.apache.org/jira/browse/YARN-3857 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: mujunchao Assignee: mujunchao Priority: Critical Attachments: YARN-3857-1.patch, YARN-3857-2.patch, YARN-3857-3.patch, hadoop-yarn-server-resourcemanager.patch We register the ClientTokenMasterKey to avoid client may hold an invalid ClientToken after RM restarts. In SIMPLE mode, we register PairApplicationAttemptId, null , But we never remove it from HashMap, as unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
Jun Gong created YARN-3998: -- Summary: Add retry-times to let NM re-launch container when it fails to run Key: YARN-3998 URL: https://issues.apache.org/jira/browse/YARN-3998 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0). It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.) We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647824#comment-14647824 ] Hudson commented on YARN-2768: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2218 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2218/]) YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * hadoop-yarn-project/CHANGES.txt Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.8.0 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647826#comment-14647826 ] Hudson commented on YARN-3950: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2218 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2218/]) YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. Contributed by Robert Kanter (jlowe: rev 2b2bd9214604bc2e14e41e08d30bf86f512151bd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java * hadoop-yarn-project/CHANGES.txt Add unique YARN_SHELL_ID environment variable to DistributedShell - Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647828#comment-14647828 ] Hudson commented on YARN-3919: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2218 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2218/]) YARN-3919. NPEs' while stopping service after exception during CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) (rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Trivial Fix For: 2.8.0 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647692#comment-14647692 ] Hudson commented on YARN-2768: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2199 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2199/]) YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.8.0 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3965) Add starup timestamp for nodemanager
[ https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-3965: -- Attachment: YARN-3965-3.patch Add starup timestamp for nodemanager Key: YARN-3965 URL: https://issues.apache.org/jira/browse/YARN-3965 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965.patch We have startup timestamp for RM already, but don't for NM. Sometimes cluster operator modified configuration of all nodes and kicked off command to restart all NMs. He found out it's hard for him to check whether all NMs are restarted. Actually there's always some NMs didn't restart as he expected, which leads to some error later due to inconsistent configuration. If we have startup timestamp for NM, the operator could easily fetch it via NM webservice and find out which NM didn't restart, and take mannaul action for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647743#comment-14647743 ] Jason Lowe commented on YARN-3990: -- Please look into the java and release audit warnings. The new TestNMUpdateEvent file probably should just be named TestNodesListManager so we can add more tests specific to the NodesListManager there. Rather than using mock RM and NM objects and running some RMApps it seems like we could have just isolated the NodesListManager more directly. We can hand it a mocked RMContext that returns a pre-baked list of apps that are alive and apps that are finished and then verify when we send a node usable/unusable event that the appropriate update events are sent to the appropriate apps by registering our own event handler with something like a drain or inline dispatcher. Not a must fix, just wondering if that approach was considered. AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch, 0002-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647785#comment-14647785 ] Hudson commented on YARN-3950: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #269 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/269/]) YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. Contributed by Robert Kanter (jlowe: rev 2b2bd9214604bc2e14e41e08d30bf86f512151bd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/CHANGES.txt Add unique YARN_SHELL_ID environment variable to DistributedShell - Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3965) Add starup timestamp for nodemanager
[ https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647799#comment-14647799 ] Hadoop QA commented on YARN-3965: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 2s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 16s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 19s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 41s | The applied patch generated 2 new checkstyle issues (total was 46, now 48). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 24s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 21s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 14s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 46m 16s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747995/YARN-3965-3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ddc867ce | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8715/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8715/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8715/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8715/console | This message was automatically generated. Add starup timestamp for nodemanager Key: YARN-3965 URL: https://issues.apache.org/jira/browse/YARN-3965 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965.patch We have startup timestamp for RM already, but don't for NM. Sometimes cluster operator modified configuration of all nodes and kicked off command to restart all NMs. He found out it's hard for him to check whether all NMs are restarted. Actually there's always some NMs didn't restart as he expected, which leads to some error later due to inconsistent configuration. If we have startup timestamp for NM, the operator could easily fetch it via NM webservice and find out which NM didn't restart, and take mannaul action for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647694#comment-14647694 ] Hudson commented on YARN-3950: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2199 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2199/]) YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. Contributed by Robert Kanter (jlowe: rev 2b2bd9214604bc2e14e41e08d30bf86f512151bd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java Add unique YARN_SHELL_ID environment variable to DistributedShell - Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647696#comment-14647696 ] Hudson commented on YARN-3919: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #2199 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2199/]) YARN-3919. NPEs' while stopping service after exception during CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) (rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Trivial Fix For: 2.8.0 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647787#comment-14647787 ] Hudson commented on YARN-3919: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #269 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/269/]) YARN-3919. NPEs' while stopping service after exception during CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) (rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Trivial Fix For: 2.8.0 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647783#comment-14647783 ] Hudson commented on YARN-2768: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #269 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/269/]) YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.8.0 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3965) Add starup timestamp for nodemanager
[ https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647709#comment-14647709 ] Hong Zhiguo commented on YARN-3965: --- made it private with Getter. Hi, [~zxu], [~jlowe], could please review the patch? Add starup timestamp for nodemanager Key: YARN-3965 URL: https://issues.apache.org/jira/browse/YARN-3965 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965.patch We have startup timestamp for RM already, but don't for NM. Sometimes cluster operator modified configuration of all nodes and kicked off command to restart all NMs. He found out it's hard for him to check whether all NMs are restarted. Actually there's always some NMs didn't restart as he expected, which leads to some error later due to inconsistent configuration. If we have startup timestamp for NM, the operator could easily fetch it via NM webservice and find out which NM didn't restart, and take mannaul action for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3998) Add retry-times to let NM re-launch container when it fails to run
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647755#comment-14647755 ] Jason Lowe commented on YARN-3998: -- Is this really a feature that YARN needs to provide? To me this is basically a case of container re-use which the application itself can control. A primitive example would be an application that launches a container that wraps the real task in a wrapper shell script or Java program that spawns the real task and will respawn it some number of times if the real task fails before failing the entire container. I'm not sure YARN is the best place to put this functionality. Add retry-times to let NM re-launch container when it fails to run -- Key: YARN-3998 URL: https://issues.apache.org/jira/browse/YARN-3998 Project: Hadoop YARN Issue Type: New Feature Reporter: Jun Gong Assignee: Jun Gong I'd like to add a field(retry-times) in ContainerLaunchContext. When AM launches containers, it could specify the value. Then NM will re-launch the container 'retry-times' times when it fails to run(e.g.exit code is not 0). It will save a lot of time. It avoids container localization. RM does not need to re-schedule the container. And local files in container's working directory will be left for re-use.(If container have downloaded some big files, it does not need to re-download them when running again.) We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648175#comment-14648175 ] Eric Payne commented on YARN-3978: -- {{checkstyle}} indicates that {{YarnConfiguration.java}} is too long. I will not be fixing that as part of this JIRA. Everything else from the build seems to be okay. [~jeagles], can you please have a look at this patch? Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch, YARN-3978.002.patch, YARN-3978.003.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3983) Make CapacityScheduler to easier extend application allocation logic
[ https://issues.apache.org/jira/browse/YARN-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648178#comment-14648178 ] Hadoop QA commented on YARN-3983: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 11s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 19s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 47s | The applied patch generated 30 new checkstyle issues (total was 54, now 55). | | {color:red}-1{color} | whitespace | 0m 2s | The patch has 16 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 21s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 49m 29s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 87m 37s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA | | | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | | | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748038/YARN-3983.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 91b42e7 | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8720/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8720/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8720/console | This message was automatically generated. Make CapacityScheduler to easier extend application allocation logic Key: YARN-3983 URL: https://issues.apache.org/jira/browse/YARN-3983 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3983.1.patch, YARN-3983.2.patch While working on YARN-1651 (resource allocation for increasing container), I found it is very hard to extend existing CapacityScheduler resource allocation logic to support different types of resource allocation. For example, there's a lot of differences between increasing a container and allocating a container: - Increasing a container doesn't need to check locality delay. - Increasing a container doesn't need to build/modify a resource request tree (ANY-RACK/HOST). - Increasing a container doesn't need to check allocation/reservation starvation (see {{shouldAllocOrReserveNewContainer}}). - After increasing a container is approved by scheduler, it need to update an existing container token instead of creating new container. And there're lots of similarities when allocating different types of resources. - User-limit/queue-limit will be enforced for both of them. - Both of them needs resource reservation logic. (Maybe continuous reservation looking is needed for both of them). The purpose of this JIRA is to make easier extending CapacityScheduler resource allocation logic to support different types of resource allocation, make common code reusable, and also better code organization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648169#comment-14648169 ] Hadoop QA commented on YARN-3906: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 31s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 48s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 15s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 7s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 25s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 41s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 50s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 29s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 38m 30s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748043/YARN-3906-YARN-2928.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / df0ec47 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8721/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8721/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8721/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8721/console | This message was automatically generated. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2884) Proxying all AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kishore Chaliparambil updated YARN-2884: Attachment: YARN-2884-V7.patch Fixed the javadoc warnings Proxying all AM-RM communications - Key: YARN-2884 URL: https://issues.apache.org/jira/browse/YARN-2884 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Carlo Curino Assignee: Kishore Chaliparambil Attachments: YARN-2884-V1.patch, YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch We introduce the notion of an RMProxy, running on each node (or once per rack). Upon start the AM is forced (via tokens and configuration) to direct all its requests to a new services running on the NM that provide a proxy to the central RM. This give us a place to: 1) perform distributed scheduling decisions 2) throttling mis-behaving AMs 3) mask the access to a federation of RMs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648218#comment-14648218 ] Zhijie Shen commented on YARN-3049: --- [~gtCarrera9], thanks for review. I've addressed most of your comments in the new patch exception followings: bq. However, I still incline to proceed the changes in this JIRA so that we can speed up consolidating our POC patches. Exactly. bq. Reader interface: use TimelineCollectorContext to package reader arguments? Yeah, I can see the rationale behind it, but maybe it's not TimelineCollectorContext. As I see a lot of arguments for the reader interface (as well as the writer one) and the potential signature change in future (e.g, adding newApp in this patch), I start to think of grouping the primitive arguments, shielding them in some category object, such as EntityContext, EntityFilters, Opts and so on, and using these as the arguments of the interface instead. Therefore, if we want to add newApp here, we don't really need to change the method signature, but add a getter/setter in Opts. Please let me know how you think about the idea. I can file another jira to deal with the issue. bq. We're now performing filters by ourselves in memory. I'm wondering if it will be more efficient to translate some of our filter specifications into HBase filters? That sounds a good idea, which should potentially improve the read performance. Let me do some investigation how to map our filter into HBase filter and push it to the backend. Given it may be a non-trivial work, can we get this patch in and follow up the filter change in another jira just in case? bq. Add a specific test in TestHBaseTimelineWriterImpl for App2FlowTable? In fact, it has been tested. I change the write path by letting newApp = true, and check if we can query the entity successfully without giving the flow/flowRun explicitly. However, I didn't do much assertion around the fields of retrieved entities, because I consider of deferring this work together with rewriting the whole HBase backend unit test. The current tests are too preliminary to capture the potential bugs around DB operations. [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch, YARN-3049-YARN-2928.3.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648336#comment-14648336 ] Hadoop QA commented on YARN-3049: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 2s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 45s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 42s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 16s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 25s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 41s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 46s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 53m 2s | Tests passed in hadoop-yarn-server-resourcemanager. | | {color:green}+1{color} | yarn tests | 1m 24s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 97m 43s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748046/YARN-3049-YARN-2928.3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / df0ec47 | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8722/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8722/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8722/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8722/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8722/console | This message was automatically generated. [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch, YARN-3049-YARN-2928.3.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3999) Add a timeout when drain the dispatcher
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-3999: - Assignee: Jian He Add a timeout when drain the dispatcher --- Key: YARN-3999 URL: https://issues.apache.org/jira/browse/YARN-3999 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He If external systems like ATS, or ZK becomes very slow, draining all the events take a lot of time. If this time becomes larger than 10 mins, all applications will expire. We can add a timeout and stop the dispatcher even if not all events are drained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648243#comment-14648243 ] Li Lu commented on YARN-3049: - Hi [~zjshen]! Some of my comments: bq. As I see a lot of arguments for the reader interface (as well as the writer one) and the potential signature change in future (e.g, adding newApp in this patch), I start to think of grouping the primitive arguments, shielding them in some category object, such as EntityContext, EntityFilters, Opts and so on, and using these as the arguments of the interface instead. I agree. Actually I spent quite some time wondering if we really need to add the {{newApp}} argument in this patch. Encapsulating all related information into a category object appears to be a nice way to avoid future interface changes. +1. bq. Given it may be a non-trivial work, can we get this patch in and follow up the filter change in another jira just in case? Definitely. Let's consolidate the whole workflow first. Then we can start these improvements. bq. In fact, it has been tested. I change the write path by letting newApp = true, and check if we can query the entity successfully without giving the flow/flowRun explicitly. However, I didn't do much assertion around the fields of retrieved entities, because I consider of deferring this work together with rewriting the whole HBase backend unit test. Sounds good to me. [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch, YARN-3049-YARN-2928.3.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2923) Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648245#comment-14648245 ] Wangda Tan commented on YARN-2923: -- Thanks for updating, [~Naganarasimha]! Some comments: 1) All script provider related configurations/logics should be removed. They should come with different patch. Such as: {{case YarnConfiguration.SCRIPT_NODE_LABELS_PROVIDER:}} And {{public static final String SCRIPT_NODE_LABELS_PROVIDER = script;}} Should be removed 2) For logics in NodeStatusUpdater. Things in my mind are: - PreviousNodeLabels will be reset everytime if we do a fetch. (To avoid handle same node labels as much as possible) - Don't reset node label if fetched node label is incorrect. (This should be a part of error handling, we should treat it's a error to be avoided instead of force reset it) - Don't do check if new fetched node label is as same as previousNodeLabels. (Also, avoid handle same node label) A little cosmetic suggestion. I found {{startStatusUpdater}} is too complex, full of try catch, etc. I suggest to make label related logic to be: a. Fetch label, check. b. handle response from RM and post process. Each of them should a separated method to improve readability. I suggest to keep provider within nodemanager (instead of yarn-server-common) for this patch, we can move it if we decide to do that in the future. Please let me know about your thoughts. Wangda Support configuration based NodeLabelsProvider Service in Distributed Node Label Configuration Setup - Key: YARN-2923 URL: https://issues.apache.org/jira/browse/YARN-2923 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2923.20141204-1.patch, YARN-2923.20141210-1.patch, YARN-2923.20150328-1.patch, YARN-2923.20150404-1.patch, YARN-2923.20150517-1.patch As part of Distributed Node Labels configuration we need to support Node labels to be configured in Yarn-site.xml. And on modification of Node Labels configuration in yarn-site.xml, NM should be able to get modified Node labels from this NodeLabelsprovider service without NM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648328#comment-14648328 ] Sangjin Lee commented on YARN-3906: --- Yes, that's a good point. I don't think the conflict will be that bad. We'll just see how these JIRAs go, and we'll adjust whichever JIRA that goes later. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3049: -- Attachment: YARN-3049-YARN-2928.3.patch [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch, YARN-3049-YARN-2928.3.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3906: -- Attachment: YARN-3906-YARN-2928.002.patch v.2 patch posted. Fixed the whitespace. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648382#comment-14648382 ] Hadoop QA commented on YARN-3906: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 39s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 3s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 16s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 6s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 26s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 41s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 47s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 23s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 38m 44s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748070/YARN-3906-YARN-2928.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / df0ec47 | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8724/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8724/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8724/console | This message was automatically generated. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648416#comment-14648416 ] Jason Lowe commented on YARN-4000: -- Example stacktrace: {noformat} 2015-07-30 22:12:03,424 ERROR [main] resourcemanager.ResourceManager (ResourceManager.java:serviceStart(582)) - Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:792) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1320) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:128) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1075) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1032) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:890) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2100(RMAppImpl.java:109) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:938) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:895) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:761) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:323) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:433) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1157) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:577) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1667) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1041) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1185) 2015-07-30 22:12:03,425 INFO [main] service.AbstractService (AbstractService.java:noteFailure(272)) - Service RMActiveServices failed in state STARTED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:792) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1320) at
[jira] [Commented] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648457#comment-14648457 ] Zhijie Shen commented on YARN-3904: --- [~gtCarrera9], thanks for the patch. Bellow are my comments: bq. The two failed tests passed on my local machine, and the failures appeared to be irrelevant. This said, we may still need to fix those intermittent test failures. Do we plan to fix it in this patch? Some high level comments: 1. As is also mentioned in YARN-3049, how about we refactoring reader/writer method signature in a separate jira to avoid conflicts? 2. I suggest moving the table creation stuff into TimelineSchemaCreator. 3. As HBase backend is accessed both directly and via Phoenix, it's good for us to cleanup the configuration to say we're using the HBase backend (comparing to FS backend) instead of specifically HBase or Phoenix writer/reader. Other patch details: 1. Make OfflineAggregationWriter extend Service, such that you don't need to define init. 2. Now we're working towards a production standard patch. Would you please write some javadoc to explain the schema of the aggregation tables like what we did for HBase tables. 3. The connection config should be moved to YarnConfiguration. 4. Why is info column family kept? I expect the aggregation table will only have metrics data 5. Let's also have a default PhoenixOfflineAggregationWriterImpl constructor to be used in the production code. 6. {{Class.forName(DRIVER_CLASS_NAME);}} doesn't need to be invoked every time we get a connection. Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3906: -- Attachment: YARN-3906-YARN-2928.003.patch v.3 Forgot to add the application table to the schema creation. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch, YARN-3906-YARN-2928.003.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648574#comment-14648574 ] Hadoop QA commented on YARN-3906: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 46s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 53s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 48s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 16s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 7s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 27s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 40s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 50s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 23s | Tests passed in hadoop-yarn-server-timelineservice. | | | | 38m 36s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748090/YARN-3906-YARN-2928.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / df0ec47 | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8727/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8727/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8727/console | This message was automatically generated. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch, YARN-3906-YARN-2928.003.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648540#comment-14648540 ] Sangjin Lee commented on YARN-3984: --- To be clear, with the latter option, if we want to look for an event by id, we can use {{ColumnPrefixFilter}} for {{e! eventId}}, right? So in that case we won't need to fetch all columns, correct? Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2884) Proxying all AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648431#comment-14648431 ] Hadoop QA commented on YARN-2884: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 48s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 7m 45s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 40s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 13s | The applied patch generated 1 new checkstyle issues (total was 237, now 237). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 23s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 6m 47s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 6m 11s | Tests passed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 52m 26s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 111m 1s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748065/YARN-2884-V7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 88d8736 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8723/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8723/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8723/console | This message was automatically generated. Proxying all AM-RM communications - Key: YARN-2884 URL: https://issues.apache.org/jira/browse/YARN-2884 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Carlo Curino Assignee: Kishore Chaliparambil Attachments: YARN-2884-V1.patch, YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch, YARN-2884-V6.patch, YARN-2884-V7.patch We introduce the notion of an RMProxy, running on each node (or once per rack). Upon start the AM is forced (via tokens and configuration) to direct all its requests to a new services running on the NM that provide a proxy to the central RM. This give us a place to: 1) perform distributed scheduling decisions 2) throttling mis-behaving AMs 3) mask the access to a federation of RMs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648488#comment-14648488 ] Arun Suresh commented on YARN-2005: --- Thanks for the patch [~adhoot], Couple of comments: # noBlacklist in DisabledBlacklistManager can be made static final. # {{getNumClusterHosts()}} in AbstractYarnScheduler : Any reason we are creating a new set ? think returning this.nodes.size() should suffice right ? # Instead of removing from the shared blacklist cause problems if the shared blacklist already contained the blacklisted node ? Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch, YARN-2005.003.patch, YARN-2005.004.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
Jason Lowe created YARN-4000: Summary: RM crashes with NPE if leaf queue becomes parent queue during restart Key: YARN-4000 URL: https://issues.apache.org/jira/browse/YARN-4000 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe This is a similar situation to YARN-2308. If an application is active in queue A and then the RM restarts with a changed capacity scheduler configuration where queue A becomes a parent queue to other subqueues then the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648521#comment-14648521 ] Li Lu commented on YARN-3984: - Thanks! I think I'm leaning towards to eventid#inverse_event_timestamp?eventKey then, if we have to do the sorting in memory anyways. Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3999) Add a timeout when drain the dispatcher
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3999: -- Attachment: YARN-3999.patch Add a timeout when drain the dispatcher --- Key: YARN-3999 URL: https://issues.apache.org/jira/browse/YARN-3999 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3999.patch If external systems like ATS, or ZK becomes very slow, draining all the events take a lot of time. If this time becomes larger than 10 mins, all applications will expire. We can add a timeout and stop the dispatcher even if not all events are drained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648516#comment-14648516 ] Vrushali C commented on YARN-3984: -- If the query has the exact timestamp as well event id, then we can. But for queries like Give me information about CONTAINER KILLED events for this application, we won't be able to return this information without querying for all events in this application. Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648359#comment-14648359 ] Li Lu commented on YARN-3906: - Hi [~sjlee0], I looked at the patch, and have one general question. It appears that the application table reuses most of the data schema of the entity table, with just some slight changes on its row keys. I've also noticed that the newly added Application*.java files overlap significantly with Entity*.java. While the current patch is totally fine in its core functions, I'm wondering if it is possible to reuse most of the code from the entity table. Ideally, we may want to build our *Table, *ColumnFamily, etc. on each new data schema, rather than on each new table? IIUC, two most significant differences between the entity table and the application table are table names and row key structures. Maybe we can change the Entity* classes to allow those differences? Or, am I missing any key points here? split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648363#comment-14648363 ] Li Lu commented on YARN-3906: - ...fixing formatting problems... bq. Hi Sangjin Lee, I looked at the patch, and have one general question. It appears that the application table reuses most of the data schema of the entity table, with just some slight changes on its row keys. I've also noticed that the newly added Application\*.java files overlap significantly with Entity\*.java. While the current patch is totally fine in its core functions, I'm wondering if it is possible to reuse most of the code from the entity table. Ideally, we may want to build our \*Table, \*ColumnFamily, etc. on each new data schema, rather than on each new table? IIUC, two most significant differences between the entity table and the application table are table names and row key structures. Maybe we can change the Entity classes to allow those differences? Or, am I missing any key points here? split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648472#comment-14648472 ] Sangjin Lee commented on YARN-3906: --- bq. I've also noticed that the newly added Application*.java files overlap significantly with Entity*.java. Thanks for bringing up that point [~gtCarrera9]. I should have added some explanations on why I wrote it this way. That is the first thing I noticed as I looked into adding the new table. \*Table and \*RowKey are not so bad, but \*ColumnFamily, \*Column, and \*ColumnPrefix have definitely a lot of overlapping code. That is largely an artifact of the design decision to use enums to implement these classes. Enums are nice because it lets us seal the list of members cleanly, and the code that uses the API becomes very strongly typed. On the other hand, the downside is that enums cannot be extended. If enums could be extended, we could have created a base class that's common both for the entity table and the application table, and have the entity table and the application table extend it pretty trivially. But unfortunately it doesn't work with enums. Nor does Java have an option of mix-ins like scala. As a way to minimize the duplication, we introduced {{ColumnHelper}} to provide many of the common operations into that helper class. You'll notice that most of the implementations in the \*Column\* classes are simple pass-through to {{ColumnHelper}}. This issue is more pronounced because the entity table and the application table are so similar. For example, for the app-to-flow table (which Zhijie is working on), this might not be as big an issue. We could think of some alternatives, but I think they also have their own challenges. First, we could think of having only one set of classes both for the entity table and the application table, and controlling which one to use via some sort of an argument/flag. But then the problem is that we would have lots of {{if application ... else ...}} code scattered around in that single implementation. I'm not sure if it is an improvement. Eventually, if this becomes more of a need, we could envision writing some sort of code generation and the table/schema description instruction so that given the schema description these classes can be simply code-generated. However, as you may know, code generation is not without problems... I hope this clarifies some of the thinking that went into this. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch, YARN-3906-YARN-2928.002.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648491#comment-14648491 ] Vrushali C commented on YARN-3984: -- To reach a conclusion on this: If everyone/most folks are +1 for putting the event timestamp before the event id itself {code} e! inverse_event_timestamp # eventid ? eventkey {code} I can go ahead and create the patch. Note that by doing so, we will *always* have to query for all event ids and all timestamps regardless of the query (unless we know the exact timestamp). If not, the other option is to put the event timestamp after the event id but before the event key.{code} e! eventid # inverse_event_timestamp ? eventkey {code} In this option, we have the option of querying for a particular event id. In both cases, we need to fetch all records, construct TimelineEvent objects and sort them for chronological order. Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3999) Add a timeout when drain the dispatcher
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3999: -- Attachment: YARN-3999.patch Add a timeout when drain the dispatcher --- Key: YARN-3999 URL: https://issues.apache.org/jira/browse/YARN-3999 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3999.patch, YARN-3999.patch If external systems like ATS, or ZK becomes very slow, draining all the events take a lot of time. If this time becomes larger than 10 mins, all applications will expire. We can add a timeout and stop the dispatcher even if not all events are drained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648603#comment-14648603 ] Hadoop QA commented on YARN-3999: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 48s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 44s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 48s | The applied patch generated 1 new checkstyle issues (total was 50, now 50). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 22s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 3s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 2m 3s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 43m 41s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 88m 24s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestApplicationCleanup | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification | | | hadoop.yarn.server.resourcemanager.TestApplicationMasterService | | | hadoop.yarn.server.resourcemanager.TestRMAdminService | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestRMRestart | | | hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens | | | hadoop.yarn.server.resourcemanager.TestClientRMService | | | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter | | | hadoop.yarn.server.resourcemanager.TestRMHA | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | | | hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | | | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748093/YARN-3999.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 88d8736 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8726/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8726/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8726/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8726/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8726/console | This message was automatically generated. Add a timeout when drain the dispatcher --- Key: YARN-3999 URL: https://issues.apache.org/jira/browse/YARN-3999 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3999.patch, YARN-3999.patch If external systems like ATS, or ZK becomes very slow, draining all the events take a lot of time. If this time becomes larger than 10 mins, all applications will expire. We can add a timeout and stop the dispatcher even if not all events are drained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3945) maxApplicationsPerUser is wrongly calculated
[ https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648418#comment-14648418 ] Nathan Roberts commented on YARN-3945: -- Hi [~leftnoteasy]. Regarding minimum_user_limit_percent. - I totally agree it is very confusing. - I don't think we can change it in any significant way at this point without a major configuration switch that clearly indicates you're getting different behavior. I'm sure admins have built up clusters with this tuned in very specific ways, a significant change wouldn't be compatible with their expectations. bq. User-limit is not a fairness mechanism to balance resources between users, instead, it can lead to bad imbalance. One example is, if we set user-limit = 50, and there are 10 users running, we cannot manage how much resource can be used by each user. I don't really agree with this. It may not be doing an ideal job but I think the intent is to introduce fairness between users. It's a progression from 0 being the most fair, and 100+ being more fifo. In your example it's trying to get everyone 50% which isn't likely to happen so in this case it's going to operate mostly fifo. If the intent is to be much more fair across the 10 users, then a much smaller value would be appropriate. bq. meaningful since #active-user is changing every minute, it is not a predictable formula. Since the scheduler can't predict what an application is going to request in the future, I don't see how a predictable formula is even possible (ignoring the possibility of taking away resources via in-queue preemption). It's not great, but being fair to currently requesting users makes some bit of sense. bq. Instead we may need to consider some notion like fair sharing: user-limit-factor becomes max-resource-limit of each user, and user-limit-percentage becomes something like guaranteed-concurrent-#user, when #user guaranteed-concurrent-#user, rest users can only get idle shares. user-limit-factor is the max-resource-limit of each user today, right? The second one seems very hard to track. It seems like one of the initial users can stay in the guaranteed set as long as he keeps requesting resources. This doesn't seem very fair to the users only getting idle shares. maxApplicationsPerUser is wrongly calculated Key: YARN-3945 URL: https://issues.apache.org/jira/browse/YARN-3945 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3945.20150728-1.patch, YARN-3945.20150729-1.patch maxApplicationsPerUser is currently calculated based on the formula {{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) * userLimitFactor)}} but description of userlimit is {quote} Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value.{color:red} The the former (the minimum value) is set to this property value {color} and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. {quote} configuration related to minimum limit should not be made used in a formula to calculate max applications for a user -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648484#comment-14648484 ] Jian He commented on YARN-3999: --- Uploaded a patch which adds the timeout on draining the dispatcher. The value is set to be half of the am-rm-expiry-time. Beyond that, I also changed the order of a couple of services which might take long time to flush the events on stop. Add a timeout when drain the dispatcher --- Key: YARN-3999 URL: https://issues.apache.org/jira/browse/YARN-3999 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3999.patch If external systems like ATS, or ZK becomes very slow, draining all the events take a lot of time. If this time becomes larger than 10 mins, all applications will expire. We can add a timeout and stop the dispatcher even if not all events are drained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648499#comment-14648499 ] Li Lu commented on YARN-3984: - Hi [~vrushalic], one quick question. I'm a little bit confused by this: bq. This would mean that we would never be able to query for a specific event. Maybe here you're assuming that the timestamp information is missing for some of our use cases? Or else, because timestamp is one of the two parts of the id of timeline event, I'm not sure why we cannot directly locate that specific column? Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3965) Add startup timestamp to nodemanager UI
[ https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-3965: -- Attachment: YARN-3965-4.patch Add startup timestamp to nodemanager UI --- Key: YARN-3965 URL: https://issues.apache.org/jira/browse/YARN-3965 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965-4.patch, YARN-3965.patch We have startup timestamp for RM already, but don't for NM. Sometimes cluster operator modified configuration of all nodes and kicked off command to restart all NMs. He found out it's hard for him to check whether all NMs are restarted. Actually there's always some NMs didn't restart as he expected, which leads to some error later due to inconsistent configuration. If we have startup timestamp for NM, the operator could easily fetch it via NM webservice and find out which NM didn't restart, and take mannaul action for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4003) ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent
Carlo Curino created YARN-4003: -- Summary: ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent Key: YARN-4003 URL: https://issues.apache.org/jira/browse/YARN-4003 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Carlo Curino The inherited behavior from LeafQueue (limit AM % based on capacity) is not a good fit for ReservationQueue (that have highly dynamic capacity). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4003) ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent
[ https://issues.apache.org/jira/browse/YARN-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648693#comment-14648693 ] Carlo Curino commented on YARN-4003: For LeafQueue makes sense to compute the resources available for AMs based on capacity. For ReserationQueue however we rely on the fact that even if their capacity it is zero, jobs can run (as resources are likely to grow substantially soon). The attached patch, proposes to use the parent guaranteed capacity as an upper-bound of how many AMs we can run. This is clearly a loose constraints, but I don't know which other value would make sense. [~leftnoteasy], [~jianhe], [~subru], any thoughts ? ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent Key: YARN-4003 URL: https://issues.apache.org/jira/browse/YARN-4003 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Carlo Curino Attachments: YARN-4003.patch The inherited behavior from LeafQueue (limit AM % based on capacity) is not a good fit for ReservationQueue (that have highly dynamic capacity). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648728#comment-14648728 ] Hadoop QA commented on YARN-3999: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 34s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 43s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 54s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 23s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 0s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 56s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 42m 51s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 83m 55s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestClientRMService | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification | | | hadoop.yarn.server.resourcemanager.security.TestAMRMTokens | | | hadoop.yarn.server.resourcemanager.TestRMHA | | | hadoop.yarn.server.resourcemanager.TestRMRestart | | | hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens | | | hadoop.yarn.server.resourcemanager.TestApplicationCleanup | | | hadoop.yarn.server.resourcemanager.TestApplicationMasterService | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | | | hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | | | hadoop.yarn.server.resourcemanager.TestRMAdminService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748093/YARN-3999.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 88d8736 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8730/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8730/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8730/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8730/console | This message was automatically generated. Add a timeout when drain the dispatcher --- Key: YARN-3999 URL: https://issues.apache.org/jira/browse/YARN-3999 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3999.patch, YARN-3999.patch If external systems like ATS, or ZK becomes very slow, draining all the events take a lot of time. If this time becomes larger than 10 mins, all applications will expire. We can add a timeout and stop the dispatcher even if not all events are drained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648733#comment-14648733 ] Hadoop QA commented on YARN-221: \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 21m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:red}-1{color} | javac | 7m 38s | The applied patch generated 1 additional warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 3m 15s | The applied patch generated 1 new checkstyle issues (total was 212, now 212). | | {color:red}-1{color} | whitespace | 1m 21s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 24s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 7m 40s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 22m 22s | Tests passed in hadoop-common. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 7m 17s | Tests passed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 52m 23s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 138m 35s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748108/YARN-221-6.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 88d8736 | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/diffJavacWarnings.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8728/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8728/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8728/console | This message was automatically generated. NM should provide a way for AM to tell it not to aggregate logs. Key: YARN-221 URL: https://issues.apache.org/jira/browse/YARN-221 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Reporter: Robert Joseph Evans Assignee: Ming Ma Attachments: YARN-221-6.patch, YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch The NodeManager should provide a way for an AM to tell it that either the logs should not be aggregated, that they should be aggregated with a high priority, or that they should be aggregated but with a lower priority. The AM should be able to do this in the ContainerLaunch context to provide a default value, but should also be able to update the value when the container is released. This would allow for the NM to not aggregate logs in some cases, and avoid connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3965) Add startup timestamp to nodemanager UI
[ https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648706#comment-14648706 ] Hadoop QA commented on YARN-3965: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 6s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 46s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 37s | The applied patch generated 1 new checkstyle issues (total was 47, now 48). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 22s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 14s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 6m 8s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 44m 1s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestDeletionService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748110/YARN-3965-4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 88d8736 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8729/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8729/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8729/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8729/console | This message was automatically generated. Add startup timestamp to nodemanager UI --- Key: YARN-3965 URL: https://issues.apache.org/jira/browse/YARN-3965 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965-4.patch, YARN-3965.patch We have startup timestamp for RM already, but don't for NM. Sometimes cluster operator modified configuration of all nodes and kicked off command to restart all NMs. He found out it's hard for him to check whether all NMs are restarted. Actually there's always some NMs didn't restart as he expected, which leads to some error later due to inconsistent configuration. If we have startup timestamp for NM, the operator could easily fetch it via NM webservice and find out which NM didn't restart, and take mannaul action for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-221) NM should provide a way for AM to tell it not to aggregate logs.
[ https://issues.apache.org/jira/browse/YARN-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-221: - Attachment: YARN-221-6.patch [~xgong] and others, here is the draft patch based on the new design. Besides the above discussions, * If the application specifies an invalid log aggregation policy class, the current implementation will fallback to the default policy instead of the failing the application. Alternative approach is to have NM fail the application instead. * For each new application, a new policy object will be created and used only by that application. This should be ok from memory footprint as well as runtime perf point of view. Alternative approach is to have applications share the same policy object if they use the same policy class and same policy parameters. NM should provide a way for AM to tell it not to aggregate logs. Key: YARN-221 URL: https://issues.apache.org/jira/browse/YARN-221 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Reporter: Robert Joseph Evans Assignee: Ming Ma Attachments: YARN-221-6.patch, YARN-221-trunk-v1.patch, YARN-221-trunk-v2.patch, YARN-221-trunk-v3.patch, YARN-221-trunk-v4.patch, YARN-221-trunk-v5.patch The NodeManager should provide a way for an AM to tell it that either the logs should not be aggregated, that they should be aggregated with a high priority, or that they should be aggregated but with a lower priority. The AM should be able to do this in the ContainerLaunch context to provide a default value, but should also be able to update the value when the container is released. This would allow for the NM to not aggregate logs in some cases, and avoid connection to the NN at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart
[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula reassigned YARN-4000: -- Assignee: Brahma Reddy Battula RM crashes with NPE if leaf queue becomes parent queue during restart - Key: YARN-4000 URL: https://issues.apache.org/jira/browse/YARN-4000 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Brahma Reddy Battula This is a similar situation to YARN-2308. If an application is active in queue A and then the RM restarts with a changed capacity scheduler configuration where queue A becomes a parent queue to other subqueues then the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4001) normalizeHostName takes too much of execution time
Hong Zhiguo created YARN-4001: - Summary: normalizeHostName takes too much of execution time Key: YARN-4001 URL: https://issues.apache.org/jira/browse/YARN-4001 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor For each NodeHeartbeatRequest, NetUtils.normalizeHostName is called under a lock. I did profiling for a very large cluster and found out NetUtils.normalizeHostName takes most execution time of ResourceTrackerService.nodeHeartbeat(...). We'd better have an option to use raw IP (plus port) as the Node identity to scale for large clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4002) make ResourceTrackerService.nodeHeartbeat more concurrent
Hong Zhiguo created YARN-4002: - Summary: make ResourceTrackerService.nodeHeartbeat more concurrent Key: YARN-4002 URL: https://issues.apache.org/jira/browse/YARN-4002 Project: Hadoop YARN Issue Type: Improvement Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Critical We have multiple RPC threads to handle NodeHeartbeatRequest from NMs. By design the method ResourceTrackerService.nodeHeartbeat should be concurrent enough to scale for large clusters. But we have a BIG log in NodesListManager.isValidNode which I think it's unnecessary. First, the fields includes and excludes of HostsFileReader are only updated on refresh nodes. All RPC threads handling node heartbeats are only readers. So RWLock could be used to have alow concurrently access by RPC threads. Second, since he fields includes and excludes of HostsFileReader are always updated by reference assignment, which is atomic in Java, the reader side lock could just be skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3965) Add startup timestamp to nodemanager UI
[ https://issues.apache.org/jira/browse/YARN-3965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648666#comment-14648666 ] Hong Zhiguo commented on YARN-3965: --- Hi, [~jlowe], version 4 of the patch is uploaded with 2 changes: 1) NodeInfo.getNmStartupTime - NodeInfo.getNMStartupTime 2) removed the final qualifier on NodeManager.nmStartupTime to avoid checkstyle error: {code} Name 'nmStartupTime' must match pattern '^[A-Z][A-Z0-9]*(_[A-Z0-9]+)*$' {code} It's private with Getter. So it's OK not to be final. Add startup timestamp to nodemanager UI --- Key: YARN-3965 URL: https://issues.apache.org/jira/browse/YARN-3965 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-3965-2.patch, YARN-3965-3.patch, YARN-3965-4.patch, YARN-3965.patch We have startup timestamp for RM already, but don't for NM. Sometimes cluster operator modified configuration of all nodes and kicked off command to restart all NMs. He found out it's hard for him to check whether all NMs are restarted. Actually there's always some NMs didn't restart as he expected, which leads to some error later due to inconsistent configuration. If we have startup timestamp for NM, the operator could easily fetch it via NM webservice and find out which NM didn't restart, and take mannaul action for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4003) ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent
[ https://issues.apache.org/jira/browse/YARN-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlo Curino updated YARN-4003: --- Attachment: YARN-4003.patch Simple proposal of a patch, where ReservationQueue overrides the LeafQueue behavior. ReservationQueue inherit getAMResourceLimit() from LeafQueue, but behavior is not consistent Key: YARN-4003 URL: https://issues.apache.org/jira/browse/YARN-4003 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Carlo Curino Attachments: YARN-4003.patch The inherited behavior from LeafQueue (limit AM % based on capacity) is not a good fit for ReservationQueue (that have highly dynamic capacity). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3528: --- Attachment: YARN-3528-005.patch Hmm..[~varun_saxena] thanks a lot for your review..Attached the patch to address your comments..( Earlier all are passed locally ) Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test Attachments: YARN-3528-002.patch, YARN-3528-003.patch, YARN-3528-004.patch, YARN-3528-005.patch, YARN-3528.patch A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3528) Tests with 12345 as hard-coded port break jenkins
[ https://issues.apache.org/jira/browse/YARN-3528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648731#comment-14648731 ] Hadoop QA commented on YARN-3528: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 8m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 40s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 18s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 6s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 2s | Tests passed in hadoop-common. | | {color:red}-1{color} | yarn tests | 6m 1s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 51m 55s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestNodeStatusUpdater | | | hadoop.yarn.server.nodemanager.TestNodeManagerShutdown | | | hadoop.yarn.server.nodemanager.TestNodeManagerResync | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748115/YARN-3528-005.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / c5caa25 | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8731/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8731/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8731/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8731/console | This message was automatically generated. Tests with 12345 as hard-coded port break jenkins - Key: YARN-3528 URL: https://issues.apache.org/jira/browse/YARN-3528 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Environment: ASF Jenkins Reporter: Steve Loughran Assignee: Brahma Reddy Battula Priority: Blocker Labels: test Attachments: YARN-3528-002.patch, YARN-3528-003.patch, YARN-3528-004.patch, YARN-3528-005.patch, YARN-3528.patch A lot of the YARN tests have hard-coded the port 12345 for their services to come up on. This makes it impossible to have scheduled or precommit tests to run consistently on the ASF jenkins hosts. Instead the tests fail regularly and appear to get ignored completely. A quick grep of 12345 shows up many places in the test suite where this practise has developed. * All {{BaseContainerManagerTest}} subclasses * {{TestNodeManagerShutdown}} * {{TestContainerManager}} + others This needs to be addressed through portscanning and dynamic port allocation. Please can someone do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission
[ https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648686#comment-14648686 ] Bibin A Chundatt commented on YARN-3940: Hi [~leftnoteasy] Thank you for review comments. {quote} We should check usage as I mentioned at: https://issues.apache.org/jira/browse/YARN-3940?focusedCommentId=14633876page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14633876. {quote} Will check how handle this too. {quote} we may need to consider how to deal with node label update, currently, if we change labels on a node, all containers running on the node will be killed. I suggest to clear think about both of the problem before moving forward. {quote} As i understand the below cases containers shouldn't be killed # Running containers of applications submitted for default partition on partition with label incase of exclusivity(false) # when queue is having access to new label / Node Any other case ? Can we move second part to separate jira for discussion ? Thoughts? Please do correct me if i am wrong. Application moveToQueue should check NodeLabel permission -- Key: YARN-3940 URL: https://issues.apache.org/jira/browse/YARN-3940 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch Configure capacity scheduler Configure node label an submit application {{queue=A Label=X}} Move application to queue {{B}} and x is not having access {code} 2015-07-20 19:46:19,626 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1437385548409_0005_01 released container container_e08_1437385548409_0005_01_02 on node: host: host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 used=memory:512, vCores:1 with event: KILL 2015-07-20 19:46:20,970 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1437385548409_0005_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, queue=b1 doesn't have permission to access all labels in resource request. labelExpression of resource request=x. Queue labels=y at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) {code} Same exception will be thrown till *heartbeat timeout* Then application state will be updated to *FAILED* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648758#comment-14648758 ] Hudson commented on YARN-433: - FAILURE: Integrated in Hadoop-trunk-Commit #8249 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8249/]) YARN-433. When RM is catching up with node updates then it should not expire acquired containers. Contributed by Xuan Gong (zxu: rev ab80e277039a586f6d6259b2511ac413e29ea4f8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, YARN-433.4.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-433: --- Fix Version/s: 2.8.0 When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Fix For: 2.8.0 Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, YARN-433.4.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-433) When RM is catching up with node updates then it should not expire acquired containers
[ https://issues.apache.org/jira/browse/YARN-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648761#comment-14648761 ] zhihai xu commented on YARN-433: Yes, thanks [~xgong]! I committed this to trunk and branch-2. When RM is catching up with node updates then it should not expire acquired containers -- Key: YARN-433 URL: https://issues.apache.org/jira/browse/YARN-433 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-433.1.patch, YARN-433.2.patch, YARN-433.3.patch, YARN-433.4.patch RM expires containers that are not launched within some time of being allocated. The default is 10mins. When an RM is not keeping up with node updates then it may not be aware of new launched containers. If the expire thread fires for such containers then the RM can expire them even though they may have launched. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3983) Make CapacityScheduler to easier extend application allocation logic
[ https://issues.apache.org/jira/browse/YARN-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648785#comment-14648785 ] Jian He commented on YARN-3983: --- - I suggest not set the state implicitly in the constructor. It’s quite confusing which constructor indicates which state. Setting it explicitly by caller makes code easier to read. {code} public ContainerAllocation(RMContainer containerToBeUnreserved) { this(containerToBeUnreserved, null, AllocationState.QUEUE_SKIPPED); } public ContainerAllocation(RMContainer containerToBeUnreserved, Resource resourceToBeAllocated) { this(containerToBeUnreserved, resourceToBeAllocated, AllocationState.SUCCEEDED); } {code} - reserved container returns SUCCEEDED state? {code} ContainerAllocation result = new ContainerAllocation(null, request.getCapability()); result.reserved = true; result.containerNodeType = type; {code} Earlier below code will not be invoked for reserved container, now it gets invoked {code} if (allocationResult.state == AllocationState.SUCCEEDED) { // Don't reset scheduling opportunities for offswitch assignments // otherwise the app will be delayed for each non-local assignment. // This helps apps with many off-cluster requests schedule faster. if (allocationResult.containerNodeType != NodeType.OFF_SWITCH) { if (LOG.isDebugEnabled()) { LOG.debug(Resetting scheduling opportunities); } application.resetSchedulingOpportunities(priority); } // Non-exclusive scheduling opportunity is different: we need reset // it every time to make sure non-labeled resource request will be // most likely allocated on non-labeled nodes first. application.resetMissedNonPartitionedRequestSchedulingOpportunity(priority); } {code} - AllocationState#SUCCEEDED - ALLOCTED. - reserved boolean flag can be changed to be Allocateion#RESERVED state. - CSAssignment#NULL_ASSIGNMENT not used, remove - comments does not match method name {code} * doAllocation needs to handle following stuffs: {code} . - Below code was originally outside of the priorities loop {code} if (SchedulerAppUtils.isBlacklisted(application, node, LOG)) { return ContainerAllocation.APP_SKIPPED; } {code} - below code can be changed to use null check ? {code} if (Resources.greaterThan(rc, clusterResource, assigned.getResourceToBeAllocated(), Resources.none())) { {code} - move the for loop into {{applicationContainerAllocator.allocate}} method {code} for (Priority priority : getPriorities()) { {code} - return ContainerAllocation.QUEUE_SKIPPED, though logically correct, semantically is incorrect, it should return Priority_Skipped {code} private ContainerAllocation assignNodeLocalContainers( Resource clusterResource, ResourceRequest nodeLocalResourceRequest, FiCaSchedulerNode node, Priority priority, RMContainer reservedContainer, SchedulingMode schedulingMode, ResourceLimits currentResoureLimits) { if (canAssign(priority, node, NodeType.NODE_LOCAL, reservedContainer)) { return assignContainer(clusterResource, node, priority, nodeLocalResourceRequest, NodeType.NODE_LOCAL, reservedContainer, schedulingMode, currentResoureLimits); } return ContainerAllocation.QUEUE_SKIPPED; } // check if the resource request can access the label if (!SchedulerUtils.checkResourceRequestMatchingNodePartition(request, node.getPartition(), schedulingMode)) { // this is a reserved container, but we cannot allocate it now according // to label not match. This can be caused by node label changed // We should un-reserve this container. return new ContainerAllocation(rmContainer); } {code} - APP_SKIPPED? original code seems skipping the priority {code} // Does the application need this resource? if (allocatedContainer == null) { // Skip this app if we failed to allocate. ContainerAllocation ret = new ContainerAllocation(allocationResult.containerToBeUnreserved); ret.state = AllocationState.APP_SKIPPED; return ret; } {code} Make CapacityScheduler to easier extend application allocation logic Key: YARN-3983 URL: https://issues.apache.org/jira/browse/YARN-3983 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3983.1.patch, YARN-3983.2.patch While working on YARN-1651 (resource allocation for increasing container), I found it is very hard to extend existing CapacityScheduler resource allocation logic to support different types of resource allocation. For example, there's a lot of differences between increasing a container and allocating a
[jira] [Commented] (YARN-3999) Add a timeout when drain the dispatcher
[ https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648786#comment-14648786 ] Xuan Gong commented on YARN-3999: - Can we move {code} transitionToStandby(false); {code} before {code} super.serviceStop(); {code} in that case, when we shut dow the RM, we transit RM to standby first (stop all the active services), then stop all the alway-on services ? Add a timeout when drain the dispatcher --- Key: YARN-3999 URL: https://issues.apache.org/jira/browse/YARN-3999 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3999.patch, YARN-3999.patch If external systems like ATS, or ZK becomes very slow, draining all the events take a lot of time. If this time becomes larger than 10 mins, all applications will expire. We can add a timeout and stop the dispatcher even if not all events are drained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3906) split the application table from the entity table
[ https://issues.apache.org/jira/browse/YARN-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3906: -- Attachment: YARN-3906-YARN-2928.001.patch v.1 patch posted. The application table is nearly identical to the entity table, except that some redundant information is omitted (e.g. entity type and entity id). The unit tests probably could be refactored a little more, but wanted to get it reviewed. split the application table from the entity table - Key: YARN-3906 URL: https://issues.apache.org/jira/browse/YARN-3906 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3906-YARN-2928.001.patch Per discussions on YARN-3815, we need to split the application entities from the main entity table into its own table (application). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648151#comment-14648151 ] Hadoop QA commented on YARN-3978: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 14s | Pre-patch trunk has 6 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 7m 43s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 40s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 41s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 19s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 31s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 57s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 21s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 0m 24s | Tests passed in hadoop-yarn-server-common. | | {color:green}+1{color} | yarn tests | 52m 15s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 97m 1s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748033/YARN-3978.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 91b42e7 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8719/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8719/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8719/console | This message was automatically generated. Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch, YARN-3978.002.patch, YARN-3978.003.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647275#comment-14647275 ] zhangyubiao commented on YARN-3979: --- I find that the CPU and load is high because of we use crontab to copy the RM Logs。 Today we stop the copy ,the CPU and load become normal 。 Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647303#comment-14647303 ] zhangyubiao commented on YARN-3979: --- I send you RM Logs just now Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3232) Some application states are not necessarily exposed to users
[ https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647319#comment-14647319 ] Varun Saxena commented on YARN-3232: I mean RM can change these states while returning the application report. Some application states are not necessarily exposed to users Key: YARN-3232 URL: https://issues.apache.org/jira/browse/YARN-3232 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Varun Saxena application NEW_SAVING and SUBMITTED states are not necessarily exposed to users as they mostly internal to the system, transient and not user-facing. We may deprecate these two states and remove them from the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3232) Some application states are not necessarily exposed to users
[ https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647315#comment-14647315 ] Varun Saxena commented on YARN-3232: [~jianhe], what about CLI ? I guess we should not show these states even there. RM can internally change NEW_SAVING and SUBMITTED states to NEW. Some application states are not necessarily exposed to users Key: YARN-3232 URL: https://issues.apache.org/jira/browse/YARN-3232 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Varun Saxena application NEW_SAVING and SUBMITTED states are not necessarily exposed to users as they mostly internal to the system, transient and not user-facing. We may deprecate these two states and remove them from the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3887: -- Attachment: 0003-YARN-3887.patch Thank you [~jianhe] and [~rohithsharma] for the comments. Updating a latest patch addressing comments. Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch, 0003-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery
[ https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647920#comment-14647920 ] Wangda Tan commented on YARN-3971: -- Looks good, committing... will add some comment to the change before commit. Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery -- Key: YARN-3971 URL: https://issues.apache.org/jira/browse/YARN-3971 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 0003-YARN-3971.patch, 0004-YARN-3971.patch Steps to reproduce # Create label x,y # Delete label x,y # Create label x,y add capacity scheduler xml for labels x and y too # Restart RM Both RM will become Standby. Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}} {code} 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state STARTED; cause: java.io.IOException: Cannot remove label=x, because queue=a1 is using this label. Please remove label on queue before remove the label java.io.IOException: Cannot remove label=x, because queue=a1 is using this label. Please remove label on queue before remove the label at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS
[ https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647873#comment-14647873 ] Jason Lowe commented on YARN-3942: -- I think so. We can probably create a new TimelineClient that stores to HDFS files based on how yarn.timeline-service.entity-file-store.summary-entity-types is configured. However I'm not sure if YARN can automatically replace timeline clients being requested with this one, as the client needs to know the application ID when putting domains and the application attempt ID when posting entities. So one approach is to have YARN provide something like a TimelineEntityFileClient, which is a TimelineClient, but Tez and other app frameworks would have to explicitly ask for it themselves and provide the appropriate application ID/app attempt ID upon construction of the client. Let me know if that sounds OK if there's an idea of how YARN can seamlessly provide this alternative client instead of TimelineClientImpl when TimelineClient.createTimelineClient is called. Timeline store to read events from HDFS --- Key: YARN-3942 URL: https://issues.apache.org/jira/browse/YARN-3942 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3942.001.patch This adds a new timeline store plugin that is intended as a stop-gap measure to mitigate some of the issues we've seen with ATS v1 while waiting for ATS v2. The intent of this plugin is to provide a workable solution for running the Tez UI against the timeline server on a large-scale clusters running many thousands of jobs per day. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3997) An Application requesting multiple core containers can't preempt running application made of single core containers
[ https://issues.apache.org/jira/browse/YARN-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3997: --- Assignee: Arun Suresh Target Version/s: 2.8.0 Priority: Critical (was: Major) An Application requesting multiple core containers can't preempt running application made of single core containers --- Key: YARN-3997 URL: https://issues.apache.org/jira/browse/YARN-3997 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.1 Environment: Ubuntu 14.04, Hadoop 2.7.1, Physical Machines Reporter: Dan Shechter Assignee: Arun Suresh Priority: Critical When our cluster is configures with preemption, and is fully loaded with an application consuming 1-core containers, it will not kill off these containers when a new application kicks in requesting, for example 4 core containers. When the second application attempts to us 1-core containers as well, preemption proceeds as planned and everything works properly. It is my assumptiom, that the fair-scheduler, while recognizing it needs to kill off some container to make room for the new application, fails to find a SINGLE container satisfying the request for a 4-core container (since all existing containers are 1-core containers), and isn't smart enough to realize it needs to kill off 4 single-core containers (in this case) on a single node, for the new application to be able to proceed... The exhibited affect is that the new application is hung indefinitely and never gets the resources it requires. This can easily be replicated with any yarn application. Our goto scenario in this case is running pyspark with 1-core executors (containers) while trying to launch h20.ai framework which INSISTS on having at least 4 cores per container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1643) Make ContainersMonitor can support change monitoring size of an allocated container in NM side
[ https://issues.apache.org/jira/browse/YARN-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647916#comment-14647916 ] Hadoop QA commented on YARN-1643: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 14s | Findbugs (version ) appears to be broken on YARN-1197. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 5 new or modified test files. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 41s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 20s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 31s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 14s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 6m 14s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 42m 42s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748018/YARN-1643-YARN-1197.7.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-1197 / cb95662 | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8716/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8716/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8716/console | This message was automatically generated. Make ContainersMonitor can support change monitoring size of an allocated container in NM side -- Key: YARN-1643 URL: https://issues.apache.org/jira/browse/YARN-1643 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1643-YARN-1197.4.patch, YARN-1643-YARN-1197.5.patch, YARN-1643-YARN-1197.6.patch, YARN-1643-YARN-1197.7.patch, YARN-1643.1.patch, YARN-1643.2.patch, YARN-1643.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1643) Make ContainersMonitor can support change monitoring size of an allocated container in NM side
[ https://issues.apache.org/jira/browse/YARN-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647926#comment-14647926 ] MENG DING commented on YARN-1643: - The failed test does not seem to be related. Make ContainersMonitor can support change monitoring size of an allocated container in NM side -- Key: YARN-1643 URL: https://issues.apache.org/jira/browse/YARN-1643 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1643-YARN-1197.4.patch, YARN-1643-YARN-1197.5.patch, YARN-1643-YARN-1197.6.patch, YARN-1643-YARN-1197.7.patch, YARN-1643.1.patch, YARN-1643.2.patch, YARN-1643.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery
[ https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3971: - Attachment: 0005-YARN-3971.patch Attached latest patch committed to trunk. Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery -- Key: YARN-3971 URL: https://issues.apache.org/jira/browse/YARN-3971 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Fix For: 2.8.0 Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 0003-YARN-3971.patch, 0004-YARN-3971.patch, 0005-YARN-3971.patch Steps to reproduce # Create label x,y # Delete label x,y # Create label x,y add capacity scheduler xml for labels x and y too # Restart RM Both RM will become Standby. Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}} {code} 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state STARTED; cause: java.io.IOException: Cannot remove label=x, because queue=a1 is using this label. Please remove label on queue before remove the label java.io.IOException: Cannot remove label=x, because queue=a1 is using this label. Please remove label on queue before remove the label at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3232) Some application states are not necessarily exposed to users
[ https://issues.apache.org/jira/browse/YARN-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647957#comment-14647957 ] Jian He commented on YARN-3232: --- agree Some application states are not necessarily exposed to users Key: YARN-3232 URL: https://issues.apache.org/jira/browse/YARN-3232 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Varun Saxena application NEW_SAVING and SUBMITTED states are not necessarily exposed to users as they mostly internal to the system, transient and not user-facing. We may deprecate these two states and remove them from the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3940) Application moveToQueue should check NodeLabel permission
[ https://issues.apache.org/jira/browse/YARN-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647955#comment-14647955 ] Wangda Tan commented on YARN-3940: -- [~bibinchundatt], I took a look at the patch, it checked app's node-label-expression AND queue's accessible-node-label, which is not enough to me. We should check usage as I mentioned at: https://issues.apache.org/jira/browse/YARN-3940?focusedCommentId=14633876page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14633876. Actually I don't have a clear idea about how to solve this problem as well. Another related problem to this is, we may need to consider how to deal with node label update, currently, if we change labels on a node, all containers running on the node will be killed. I suggest to clear think about both of the problem before moving forward. Application moveToQueue should check NodeLabel permission -- Key: YARN-3940 URL: https://issues.apache.org/jira/browse/YARN-3940 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Attachments: 0001-YARN-3940.patch, 0002-YARN-3940.patch Configure capacity scheduler Configure node label an submit application {{queue=A Label=X}} Move application to queue {{B}} and x is not having access {code} 2015-07-20 19:46:19,626 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1437385548409_0005_01 released container container_e08_1437385548409_0005_01_02 on node: host: host-10-19-92-117:64318 #containers=1 available=memory:2560, vCores:15 used=memory:512, vCores:1 with event: KILL 2015-07-20 19:46:20,970 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1437385548409_0005_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, queue=b1 doesn't have permission to access all labels in resource request. labelExpression of resource request=x. Queue labels=y at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:304) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:250) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:106) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:515) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2174) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2168) {code} Same exception will be thrown till *heartbeat timeout* Then application state will be updated to *FAILED* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3942) Timeline store to read events from HDFS
[ https://issues.apache.org/jira/browse/YARN-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648057#comment-14648057 ] Jason Lowe commented on YARN-3942: -- The logs are created based on app attempt. It helps avoid the split-brain, double-writer issue where the previous attempt is still running when the RM expires it (e.g.: due to network cut) and decides to launch another. The files are stored and looked up in a directory that is named after the application ID, and the entity files within that directory are stored based on application attempt ID. I don't think the latter is crucial to use the app attempt ID and the reader is not relying on the attempt ID from those files, but it was a simple way to avoid colliding with previous attempts and having the reader process the files in attempt order. Timeline store to read events from HDFS --- Key: YARN-3942 URL: https://issues.apache.org/jira/browse/YARN-3942 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-3942.001.patch This adds a new timeline store plugin that is intended as a stop-gap measure to mitigate some of the issues we've seen with ATS v1 while waiting for ATS v2. The intent of this plugin is to provide a workable solution for running the Tez UI against the timeline server on a large-scale clusters running many thousands of jobs per day. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648089#comment-14648089 ] Hadoop QA commented on YARN-3887: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 58s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 10m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 25s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 31s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 36s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 7s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 43s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 38s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 53m 4s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 100m 19s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12748024/0003-YARN-3887.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8acb30b | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8717/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8717/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8717/console | This message was automatically generated. Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch, 0003-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1643) Make ContainersMonitor can support change monitoring size of an allocated container in NM side
[ https://issues.apache.org/jira/browse/YARN-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-1643: Attachment: YARN-1643-YARN-1197.7.patch Attaching latest patch which addressed the following: * Make {{trackingContainers}} a ConcurrentHashMap, and remove {{containersToBeRemoved}} and {{containersToBeAdded}} and corresponding logic. Containers are directly added to/removed from/updated in {{trackingContainers}} when corresponding events are received. * Synchronize getters and setters in {{ProcessTreeInfo}} with regard to the vmemLimit/pmemLimit/cpuVcores fields. * Previous patch didn't handle container metrics update for container resize. Add it and extract container metrics logic into a common function. * Add relevant test cases Make ContainersMonitor can support change monitoring size of an allocated container in NM side -- Key: YARN-1643 URL: https://issues.apache.org/jira/browse/YARN-1643 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1643-YARN-1197.4.patch, YARN-1643-YARN-1197.5.patch, YARN-1643-YARN-1197.6.patch, YARN-1643-YARN-1197.7.patch, YARN-1643.1.patch, YARN-1643.2.patch, YARN-1643.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3990: --- Attachment: 0003-YARN-3990.patch [~jlowe] Thank you for your comments {quote} java and release audit warnings. {quote} Done {quote} The new TestNMUpdateEvent file probably should just be named TestNodesListManager {quote} Done AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch, 0002-YARN-3990.patch, 0003-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3971) Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery
[ https://issues.apache.org/jira/browse/YARN-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647949#comment-14647949 ] Bibin A Chundatt commented on YARN-3971: Thanks [~leftnoteasy] for review and committing patch Skip RMNodeLabelsManager#checkRemoveFromClusterNodeLabelsOfQueue on nodelabel recovery -- Key: YARN-3971 URL: https://issues.apache.org/jira/browse/YARN-3971 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Fix For: 2.8.0 Attachments: 0001-YARN-3971.patch, 0002-YARN-3971.patch, 0003-YARN-3971.patch, 0004-YARN-3971.patch, 0005-YARN-3971.patch Steps to reproduce # Create label x,y # Delete label x,y # Create label x,y add capacity scheduler xml for labels x and y too # Restart RM Both RM will become Standby. Since below exception is thrown on {{FileSystemNodeLabelsStore#recover}} {code} 2015-07-23 14:03:33,627 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager failed in state STARTED; cause: java.io.IOException: Cannot remove label=x, because queue=a1 is using this label. Please remove label on queue before remove the label java.io.IOException: Cannot remove label=x, because queue=a1 is using this label. Please remove label on queue before remove the label at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.checkRemoveFromClusterNodeLabelsOfQueue(RMNodeLabelsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager.removeFromClusterNodeLabels(RMNodeLabelsManager.java:118) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:221) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:232) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:245) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:587) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:964) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1005) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1001) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1001) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:312) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:832) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:422) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)