[jira] [Updated] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2768: --- Summary: Avoid cloning Resource in FSAppAttempt#updateDemand (was: Avoid cloning Resource in FSAppAttempt.updateDemand) Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2768) Avoid cloning Resource in FSAppAttempt.updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2768: --- Summary: Avoid cloning Resource in FSAppAttempt.updateDemand (was: optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread) Avoid cloning Resource in FSAppAttempt.updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3919: Attachment: 0003-YARN-3919.patch The current patch not able to apply in my machine. So regenerating the same patches from my machine. Uploading to HadoopQA to kick off before commmit.. NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646344#comment-14646344 ] Robert Kanter commented on YARN-3950: - Thanks! Add unique SHELL_ID environment variable to DistributedShell Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3989) Show messages only for NodeLabel commmands in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646361#comment-14646361 ] Bibin A Chundatt commented on YARN-3989: [~sunilg] and [~Naganarasimha] Currently only NodeLabel i would like to put in scope for this Jira.Other commands makes it more complicated. Few commands as [~sunilg] pointed out. # addToCluserNodeLabels # removeFromClusterNodeLabels # replaceLabelsOnNode Its a huge stack trace which is shown in console and from trace it becomes very difficult to get the actual error message.Most of the cases only direct messages are intended for users not full stack trace. So as of now will define scope only to Nodelabel commands in this jira. Show messages only for NodeLabel commmands in RMAdminCLI Key: YARN-3989 URL: https://issues.apache.org/jira/browse/YARN-3989 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Currently for nodelabel command execution failure full exception stacktrace is shown. This jira is to handle exceptions and show only messages. As per the discussion in YARN-3963 [~sunilg] {quote} As I see it, I can see full exception stack trace in client side in this case (also in case of other commands too) and its too verbose. I think we can make its compact and n it will be more easily readable. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646374#comment-14646374 ] Karthik Kambatla commented on YARN-2768: +1 optimize FSAppAttempt.updateDemand by avoid clone of Resource which takes 85% of computing time of update thread Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2768. Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Just committed to trunk and branch-2. Thanks [~zhiguohong] for the contribution. Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.8.0 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646465#comment-14646465 ] Li Lu commented on YARN-3049: - Thanks [~zjshen]! For now I think it's fine to include the changes on app2flow table. I'll take a look at your latest patch. [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1643) Make ContainersMonitor can support change monitoring size of an allocated container in NM side
[ https://issues.apache.org/jira/browse/YARN-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646403#comment-14646403 ] MENG DING commented on YARN-1643: - Thanks [~jianhe] for the review: bq. Here, memory * 2^20, but it gets reverted later on at info.pmemLimit 20, we can just use the original value ? Will do. bq. Do you think we can change the trackingContainers to be concurrentHashMap and update the ptInfo directly ? Also the getter and setter of ptInfo can synchronize on the ptInfo object Yes, we can make {{trackingContainers}} a {{ConcurrentHashMap}}, and add setter to {{ProcessTreeInfo}} for vemLimit, pmemLimit, and cpuVcores, then have the getter and setter synchronized on the object. IIUC, the main benefit is that we don't need to synchronize on the {{enforceResourceLimits}} call, which can be heavy, right? If that is the case, we probably also need to have proper synchronization for {{ResourceCalculatorProcessTree}}, e.g., {{ProcfsBasedProcessTree}}/{{WindowsBasedProcessTree}}? These objects could be updated by multiple threads as well. I was afraid that the code change may be too much? For other objects like {{containersToBeChanged}}, {{containersToBeAdded}}, {{containersToBeRemoved}}, I think we still need to synchronize on the entire map like the way it is right now, because we are calling functions like {{containersToBeRemoved.clear()}}. Thoughts? Make ContainersMonitor can support change monitoring size of an allocated container in NM side -- Key: YARN-1643 URL: https://issues.apache.org/jira/browse/YARN-1643 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1643-YARN-1197.4.patch, YARN-1643-YARN-1197.5.patch, YARN-1643-YARN-1197.6.patch, YARN-1643.1.patch, YARN-1643.2.patch, YARN-1643.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3991) Investigate if we need an atomic way to set both memory and CPU on Resource
[ https://issues.apache.org/jira/browse/YARN-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3991: -- Assignee: Varun Saxena Investigate if we need an atomic way to set both memory and CPU on Resource --- Key: YARN-3991 URL: https://issues.apache.org/jira/browse/YARN-3991 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: Karthik Kambatla Assignee: Varun Saxena Labels: capacityscheduler, fairscheduler, scheduler While reviewing another patch, noticed that we have independent methods to set memory and CPU. Do we need another method to set them both atomically? Otherwise, would two threads trying to set both values lose any updates? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646454#comment-14646454 ] Sunil G commented on YARN-3992: --- Thank you [~zjshen] for reporting the same. I will take a look into this. TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Sunil G {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646339#comment-14646339 ] Zhijie Shen commented on YARN-3049: --- TestApplicationPriority.testApplicationPriorityAllocation seems to have a race condition issue. I cannot reproduce it locally both on trunk or with on YARN-2928 with this patch. Anyway, it seems not to be related to this jira. Will file a separate Jira to track the test failure. [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646443#comment-14646443 ] Zhijie Shen commented on YARN-3992: --- The problem was found with jenkins build on YARN-3049: https://builds.apache.org/job/PreCommit-YARN-Build/8701/testReport/ TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646452#comment-14646452 ] Sunil G commented on YARN-3992: --- Thank you [~zhi TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Sunil G {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2768) Avoid cloning Resource in FSAppAttempt#updateDemand
[ https://issues.apache.org/jira/browse/YARN-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646401#comment-14646401 ] Hudson commented on YARN-2768: -- FAILURE: Integrated in Hadoop-trunk-Commit #8241 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8241/]) YARN-2768. Avoid cloning Resource in FSAppAttempt#updateDemand. (Hong Zhiguo via kasha) (kasha: rev 5205a330b387d2e133ee790b9fe7d5af3cd8bccc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * hadoop-yarn-project/CHANGES.txt Avoid cloning Resource in FSAppAttempt#updateDemand --- Key: YARN-2768 URL: https://issues.apache.org/jira/browse/YARN-2768 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Fix For: 2.8.0 Attachments: YARN-2768.patch, profiling_FairScheduler_update.png See the attached picture of profiling result. The clone of Resource object within Resources.multiply() takes up **85%** (19.2 / 22.6) CPU time of the function FairScheduler.update(). The code of FSAppAttempt.updateDemand: {code} public void updateDemand() { demand = Resources.createResource(0); // Demand is current consumption plus outstanding requests Resources.addTo(demand, app.getCurrentConsumption()); // Add up outstanding resource requests synchronized (app) { for (Priority p : app.getPriorities()) { for (ResourceRequest r : app.getResourceRequests(p).values()) { Resource total = Resources.multiply(r.getCapability(), r.getNumContainers()); Resources.addTo(demand, total); } } } } {code} The code of Resources.multiply: {code} public static Resource multiply(Resource lhs, double by) { return multiplyTo(clone(lhs), by); } {code} The clone could be skipped by directly update the value of this.demand. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3950) Add unique SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646268#comment-14646268 ] Hudson commented on YARN-3950: -- FAILURE: Integrated in Hadoop-trunk-Commit #8239 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8239/]) YARN-3950. Add unique SHELL_ID environment variable to DistributedShell. Contributed by Robert Kanter (jlowe: rev 2b2bd9214604bc2e14e41e08d30bf86f512151bd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDSAppMaster.java Add unique SHELL_ID environment variable to DistributedShell Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3950) Add unique YARN_SHELL_ID environment variable to DistributedShell
[ https://issues.apache.org/jira/browse/YARN-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-3950: --- Summary: Add unique YARN_SHELL_ID environment variable to DistributedShell (was: Add unique SHELL_ID environment variable to DistributedShell) Add unique YARN_SHELL_ID environment variable to DistributedShell - Key: YARN-3950 URL: https://issues.apache.org/jira/browse/YARN-3950 Project: Hadoop YARN Issue Type: Improvement Components: applications/distributed-shell Affects Versions: 2.8.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.8.0 Attachments: YARN-3950.001.patch, YARN-3950.002.patch As discussed in [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6415?focusedCommentId=14636027page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14636027], it would be useful to have a monotonically increasing and independent ID of some kind that is unique per shell in the distributed shell program. We can do that by adding a SHELL_ID env var. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3991) Investigate if we need an atomic way to set both memory and CPU on Resource
Karthik Kambatla created YARN-3991: -- Summary: Investigate if we need an atomic way to set both memory and CPU on Resource Key: YARN-3991 URL: https://issues.apache.org/jira/browse/YARN-3991 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: Karthik Kambatla While reviewing another patch, noticed that we have independent methods to set memory and CPU. Do we need another method to set them both atomically? Otherwise, would two threads trying to set both values lose any updates? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3991) Investigate if we need an atomic way to set both memory and CPU on Resource
[ https://issues.apache.org/jira/browse/YARN-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3991: --- Labels: capacityscheduler fairscheduler scheduler (was: ) Investigate if we need an atomic way to set both memory and CPU on Resource --- Key: YARN-3991 URL: https://issues.apache.org/jira/browse/YARN-3991 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: Karthik Kambatla Labels: capacityscheduler, fairscheduler, scheduler While reviewing another patch, noticed that we have independent methods to set memory and CPU. Do we need another method to set them both atomically? Otherwise, would two threads trying to set both values lose any updates? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646415#comment-14646415 ] Varun Saxena commented on YARN-3919: [~rohithsharma], are you using {{patch -p0}} command ? NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646448#comment-14646448 ] Rohith Sharma K S commented on YARN-3919: - No... git apply --whitespace=fix patch-file NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646449#comment-14646449 ] zhihai xu commented on YARN-3990: - Yes, that is a good catch! {{rmContext.getRMApps()}} stores both completed and running APPs, Current default value for max-completed-applications is 1, we may save up-to 1 NodeUpdateEvent. +1 too. AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
Zhijie Shen created YARN-3992: - Summary: TestApplicationPriority.testApplicationPriorityAllocation fails intermittently Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-3992: - Assignee: Sunil G TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Sunil G {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646501#comment-14646501 ] Hadoop QA commented on YARN-3919: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 6s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 10m 22s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 12m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 28s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 5s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 44s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 42s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 56s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 2m 12s | Tests passed in hadoop-yarn-common. | | | | 51m 18s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747812/0003-YARN-3919.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 5205a33 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8706/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8706/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8706/console | This message was automatically generated. NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646566#comment-14646566 ] Anubhav Dhoot commented on YARN-3990: - Change looks good. It would be good to have a unit test to catch regressions. AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3983) Make CapacityScheduler to easier extend application allocation logic
[ https://issues.apache.org/jira/browse/YARN-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646835#comment-14646835 ] Hadoop QA commented on YARN-3983: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 54s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 9m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 12m 3s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 24s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 8s | The applied patch generated 41 new checkstyle issues (total was 54, now 66). | | {color:red}-1{color} | whitespace | 0m 2s | The patch has 16 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 2m 0s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 39s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 51s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 54m 36s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 101m 24s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions | | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747851/YARN-3983.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c020b62 | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/8709/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8709/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8709/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8709/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8709/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8709/console | This message was automatically generated. Make CapacityScheduler to easier extend application allocation logic Key: YARN-3983 URL: https://issues.apache.org/jira/browse/YARN-3983 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3983.1.patch While working on YARN-1651 (resource allocation for increasing container), I found it is very hard to extend existing CapacityScheduler resource allocation logic to support different types of resource allocation. For example, there's a lot of differences between increasing a container and allocating a container: - Increasing a container doesn't need to check locality delay. - Increasing a container doesn't need to build/modify a resource request tree (ANY-RACK/HOST). - Increasing a container doesn't need to check allocation/reservation starvation (see {{shouldAllocOrReserveNewContainer}}). - After increasing a container is approved by scheduler, it need to update an existing container token instead of creating new container. And there're lots of similarities when allocating different types of resources. - User-limit/queue-limit will be enforced for both of them. - Both of them needs resource reservation logic. (Maybe continuous reservation looking is needed for both of them). The purpose of this JIRA is to make easier extending CapacityScheduler resource allocation logic to support different types of resource allocation, make common code reusable, and also better code organization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3994) RM should respect AM resource/placement constraints
[ https://issues.apache.org/jira/browse/YARN-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646756#comment-14646756 ] Anubhav Dhoot commented on YARN-3994: - This jira should incorporate support for blacklisting done in this related jira RM should respect AM resource/placement constraints --- Key: YARN-3994 URL: https://issues.apache.org/jira/browse/YARN-3994 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Today, locality and cpu for the AM can be specified in the AM launch container request but are ignored at the RM. Locality is assumed to be ANY and cpu is dropped. There may be other things too that are ignored. This should be fixed so that the user gets what is specified in their code to launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1643) Make ContainersMonitor can support change monitoring size of an allocated container in NM side
[ https://issues.apache.org/jira/browse/YARN-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646709#comment-14646709 ] MENG DING commented on YARN-1643: - bq. IIUC, the main benefit is that we don't need to synchronize on the enforceResourceLimits call, which can be heavy, right? If that is the case, we probably also need to have proper synchronization for ResourceCalculatorProcessTree, e.g., ProcfsBasedProcessTree/WindowsBasedProcessTree? These objects could be updated by multiple threads as well. I was afraid that the code change may be too much? I think I find a way without having to synchronize on {{ResourceCalculatorProcessTree}}. All that is needed for synchronization in this class is the trackingContainers map and the access to the vmemLimit/pmemLimit/cpuVcores fields. The actual resource limit enforcement can still be handled in the {{MonitoringThread.run}} thread only. Make ContainersMonitor can support change monitoring size of an allocated container in NM side -- Key: YARN-1643 URL: https://issues.apache.org/jira/browse/YARN-1643 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan Assignee: MENG DING Attachments: YARN-1643-YARN-1197.4.patch, YARN-1643-YARN-1197.5.patch, YARN-1643-YARN-1197.6.patch, YARN-1643.1.patch, YARN-1643.2.patch, YARN-1643.3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3974) Refactor the reservation system test cases to use parameterized base test
[ https://issues.apache.org/jira/browse/YARN-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-3974: - Attachment: YARN-3974-v4.patch Uploading an updated patch that fixes the checkstyle issue Refactor the reservation system test cases to use parameterized base test - Key: YARN-3974 URL: https://issues.apache.org/jira/browse/YARN-3974 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Attachments: YARN-3974-v1.patch, YARN-3974-v2.patch, YARN-3974-v3.patch, YARN-3974-v4.patch We have two test suites for testing ReservationSystem against Capacity Fair scheduler. We should combine them using a parametrized reservation system base test similar to YARN-2797 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3994) RM should respect AM resource/placement constraints
[ https://issues.apache.org/jira/browse/YARN-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev reassigned YARN-3994: --- Assignee: Varun Vasudev RM should respect AM resource/placement constraints --- Key: YARN-3994 URL: https://issues.apache.org/jira/browse/YARN-3994 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Assignee: Varun Vasudev Today, locality and cpu for the AM can be specified in the AM launch container request but are ignored at the RM. Locality is assumed to be ANY and cpu is dropped. There may be other things too that are ignored. This should be fixed so that the user gets what is specified in their code to launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646510#comment-14646510 ] Rohith Sharma K S commented on YARN-3979: - Oops, 50 lakh events I checked the attached logs, since you have attached only ERROR logs, did not able to trace it. One observation is there are many InvalidStateTransitions events CLEAN_UP in RMNodeImpl. # Would you possible give RM logs, if not able to attach to JIRA, could you send me through mail. # would give more info like what is the cluster size? how much is apps are running? how many were completed? What is the state of state of NodeManager i.e whether they are running OR any other state? Which version of Hadoop are you using? Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
Zhijie Shen created YARN-3993: - Summary: Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3919: Priority: Trivial (was: Major) NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Trivial Fix For: 2.8.0 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
[ https://issues.apache.org/jira/browse/YARN-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646559#comment-14646559 ] Sunil G commented on YARN-3993: --- OK. I understood slightly different. With the new changes from YARN-3116, we would like to change to know a container is AM or not based on type (not by the existing way of using container ID). If its fine, I can look into this. Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Labels: newbie After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
[ https://issues.apache.org/jira/browse/YARN-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3993: -- Labels: newbie (was: ) Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Labels: newbie After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
[ https://issues.apache.org/jira/browse/YARN-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646558#comment-14646558 ] Zhijie Shen edited comment on YARN-3993 at 7/29/15 6:26 PM: [~sunilg], thanks for your interest. It's not related to RM. In YARN-3116, we already build the channel to propagate the AM flag to aux service. What we need to do here is simply update the way that PerNodeTimelineCollectorsAuxService determine if the container is AM or not. Feel free to pick it up if you want to ramp up with TS v2. was (Author: zjshen): [~sunilg], thanks for your interest. It's not related to RM. In YARN-3116, we already build the channel to propagate the AM flag to aux service. What we need to do here is simply update the way that PerNodeTimelineCollectorsAuxService determine if the container is AM or not. Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Labels: newbie After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3994) RM should respect AM resource/placement constraints
[ https://issues.apache.org/jira/browse/YARN-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646707#comment-14646707 ] Wangda Tan commented on YARN-3994: -- +1 for this, we should target this to 2.8. RM should respect AM resource/placement constraints --- Key: YARN-3994 URL: https://issues.apache.org/jira/browse/YARN-3994 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Today, locality and cpu for the AM can be specified in the AM launch container request but are ignored at the RM. Locality is assumed to be ANY and cpu is dropped. There may be other things too that are ignored. This should be fixed so that the user gets what is specified in their code to launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646490#comment-14646490 ] Rohith Sharma K S commented on YARN-3250: - Adding to User API discussion, the ApplicationCLI command can be {{./yarn application appId --set-priority ApplicationId --priority value}} Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646540#comment-14646540 ] Hudson commented on YARN-3919: -- FAILURE: Integrated in Hadoop-trunk-Commit #8242 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8242/]) YARN-3919. NPEs' while stopping service after exception during CommonNodeLabelsManager#start. (varun saxena via rohithsharmaks) (rohithsharmaks: rev c020b62cf8de1f3baadc9d2f3410640ef7880543) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646556#comment-14646556 ] Xuan Gong commented on YARN-3543: - Sorry for the late. The patch looks good overall. But we still made some un-necessary changes. * Changes made for RM side look good. * Changes on ATS side, I think that we could follow the changes from YARN-1462, which will include: ** ApplicationHistoryManagerOnTimelineStore ** TestApplicationHistoryManagerOnTimelineStore ** ApplicationMetricsConstants ** ApplicationCreatedEvent ** SystemMetricsPublisher ** TestSystemMetricsPublisher ** TimelineServer ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3994) RM should respect AM resource/placement constraints
[ https://issues.apache.org/jira/browse/YARN-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3994: - Target Version/s: 2.8.0 RM should respect AM resource/placement constraints --- Key: YARN-3994 URL: https://issues.apache.org/jira/browse/YARN-3994 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Today, locality and cpu for the AM can be specified in the AM launch container request but are ignored at the RM. Locality is assumed to be ANY and cpu is dropped. There may be other things too that are ignored. This should be fixed so that the user gets what is specified in their code to launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646676#comment-14646676 ] Jian He commented on YARN-3887: --- sounds good to me Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
[ https://issues.apache.org/jira/browse/YARN-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-3993: - Assignee: Sunil G Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Sunil G Labels: newbie After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646598#comment-14646598 ] Rohith Sharma K S commented on YARN-3543: - I got what you mean!! Right.. Modifying other files like *ApplicationStartData* and others are related to applicationhistoryservice I think. Is it so? ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3978) Configurably turn off the saving of container info in Generic AHS
[ https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646679#comment-14646679 ] Hadoop QA commented on YARN-3978: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 48s | Pre-patch trunk has 6 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 9m 21s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 12m 2s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 51s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 4m 58s | The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 31s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 0m 29s | Tests passed in hadoop-yarn-server-common. | | {color:red}-1{color} | yarn tests | 54m 33s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 105m 54s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-resourcemanager | | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747634/YARN-3978.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c020b62 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8707/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8707/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8707/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8707/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8707/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8707/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8707/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8707/console | This message was automatically generated. Configurably turn off the saving of container info in Generic AHS - Key: YARN-3978 URL: https://issues.apache.org/jira/browse/YARN-3978 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver, yarn Affects Versions: 2.8.0, 2.7.1 Reporter: Eric Payne Assignee: Eric Payne Attachments: YARN-3978.001.patch, YARN-3978.002.patch Depending on how each application's metadata is stored, one week's worth of data stored in the Generic Application History Server's database can grow to be almost a terabyte of local disk space. In order to alleviate this, I suggest that there is a need for a configuration option to turn off saving of non-AM container metadata in the GAHS data store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646700#comment-14646700 ] Bikas Saha commented on YARN-2005: -- I am fine for opening a separate jira for the specific case I mentioned. Opened YARN-3994 for that. If you want you can extend its scope to blacklisting. Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch, YARN-2005.003.patch, YARN-2005.004.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3994) RM should respect AM resource/placement constraints
Bikas Saha created YARN-3994: Summary: RM should respect AM resource/placement constraints Key: YARN-3994 URL: https://issues.apache.org/jira/browse/YARN-3994 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Today, locality and cpu for the AM can be specified in the AM launch container request but are ignored at the RM. Locality is assumed to be ANY and cpu is dropped. There may be other things too that are ignored. This should be fixed so that the user gets what is specified in their code to launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
[ https://issues.apache.org/jira/browse/YARN-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646532#comment-14646532 ] Sunil G commented on YARN-3993: --- Hi [~zjshen] In SchedulerApplicationAttempt,we can use RMContainer#isAMContainer() api to know that. Its bee done this as per f YARN-2022. Cud I take over this. Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3992: -- Attachment: 0001-YARN-3992.patch Thank you [~zjshen]. It seems we were not waiting for full containers to get allocated. I have now updated code so that we wait for all containers to get allocated. TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Sunil G Attachments: 0001-YARN-3992.patch {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646612#comment-14646612 ] Anubhav Dhoot commented on YARN-2005: - I think blacklisting can have lots of policies and constraints and will probably change over time. Since RMAppAttemptImpl#ScheduleTransition drops the locality constraint it seems ok for the current blacklisting to also be locality constraint unaware. Should we start simple and keep a separate jira for honoring am locality in scheduling and blacklisting at the same time? [~jianhe],[~bikassaha] let me know if you agree and I can file that jira. Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch, YARN-2005.003.patch, YARN-2005.004.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646675#comment-14646675 ] Hadoop QA commented on YARN-3992: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 6m 19s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 11s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 29s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 22s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 27s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 53m 7s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 71m 52s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747833/0001-YARN-3992.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / c020b62 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8708/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8708/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8708/console | This message was automatically generated. TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Sunil G Attachments: 0001-YARN-3992.patch {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
[ https://issues.apache.org/jira/browse/YARN-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-3993: -- Affects Version/s: YARN-2928 Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Zhijie Shen Assignee: Sunil G Labels: newbie After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3983) Make CapacityScheduler to easier extend application allocation logic
[ https://issues.apache.org/jira/browse/YARN-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3983: - Attachment: YARN-3983.1.patch Attached initial patch for review. Make CapacityScheduler to easier extend application allocation logic Key: YARN-3983 URL: https://issues.apache.org/jira/browse/YARN-3983 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3983.1.patch While working on YARN-1651 (resource allocation for increasing container), I found it is very hard to extend existing CapacityScheduler resource allocation logic to support different types of resource allocation. For example, there's a lot of differences between increasing a container and allocating a container: - Increasing a container doesn't need to check locality delay. - Increasing a container doesn't need to build/modify a resource request tree (ANY-RACK/HOST). - Increasing a container doesn't need to check allocation/reservation starvation (see {{shouldAllocOrReserveNewContainer}}). - After increasing a container is approved by scheduler, it need to update an existing container token instead of creating new container. And there're lots of similarities when allocating different types of resources. - User-limit/queue-limit will be enforced for both of them. - Both of them needs resource reservation logic. (Maybe continuous reservation looking is needed for both of them). The purpose of this JIRA is to make easier extending CapacityScheduler resource allocation logic to support different types of resource allocation, make common code reusable, and also better code organization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646478#comment-14646478 ] Rohith Sharma K S commented on YARN-3250: - Hi [~sunilg] As part of this JIRA, # User API : ## I am planning to introduce {{ApplicationClientProtocol#setPriority(SetApplicationProrityRequest)}}. *SetApplicationProrityRequest* comprises of ApplicationId and Priority. The clientRMService invokes API introduced by YARN-3887 i.e. updateApplicationPriority(); ## Thinking that does getPriority is required at user side? I feel that, since ApplicationReport can gives the priority of an application, this API is NOT required to have. What do you suggests, any thoughts? # Admin API : ## As admin, he should be able to change the *cluster-max-application-priority* value. Having an rmadmin API would be great!!. But one issue in with api is that cluster-max-application-priority is inmemory, but when rmadmin updates it, inmemory value can be updated. But in HA/Restart cases, the configuration value is taken. So I suggest to store cluster-level-application-priority in store and whenever RM is switched/Restarted, give higher preference to store. What do you think about this approach? Apart from above API's , should there any new API's to be added? Kindly share your thoughts? Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646568#comment-14646568 ] Rohith Sharma K S commented on YARN-3543: - Thanks [~xgong] for review.. bq. But we still made some un-necessary changes. sorry could not get what are un necessary changes. Could you explain please? ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3127) Avoid timeline events during RM recovery or restart
[ https://issues.apache.org/jira/browse/YARN-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646668#comment-14646668 ] Naganarasimha G R commented on YARN-3127: - Hi [~xgong], [~gtCarrera] [~ozawa], Can any one you have a look at this jira ? Avoid timeline events during RM recovery or restart --- Key: YARN-3127 URL: https://issues.apache.org/jira/browse/YARN-3127 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0, 2.7.1 Environment: RM HA with ATS Reporter: Bibin A Chundatt Assignee: Naganarasimha G R Priority: Critical Attachments: AppTransition.png, YARN-3127.20150213-1.patch, YARN-3127.20150329-1.patch, YARN-3127.20150624-1.patch 1.Start RM with HA and ATS configured and run some yarn applications 2.Once applications are finished sucessfully start timeline server 3.Now failover HA form active to standby 4.Access timeline server URL IP:PORT/applicationhistory //Note Earlier exception was thrown when accessed. Incomplete information is shown in the ATS web UI. i.e. attempt container and other information is not displayed. Also even if timeline server is started with RM, and on RM restart/ recovery ATS events for the applications already existing in ATS are resent which is not required. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646495#comment-14646495 ] Rohith Sharma K S commented on YARN-3250: - small correction in above syntax. Correct syntax is {{./yarn application --set-priority ApplicationId --priority value}} Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646518#comment-14646518 ] Sunil G commented on YARN-3250: --- Hi [~rohithsharma] Thank you for bringing up with api suggestions. I have few comments. bq.ApplicationClientProtocol#setPriority(SetApplicationProrityRequest) Could we use api name as {{setApplicationPriority}} bq. I suggest to store cluster-level-application-priority in store and whenever RM is switched/Restarted, give higher preference to store. I think this is a known design dilema we have in Yarn now. Once a centralized config tickets are done, we can have a clear solution. I am fine with having a priority given to RMStateStore over config file during restart. If there are no configuration changes, we can use value from yarn-site.xml. How will be the storage location path for this cluster-application-priority. I think we can group under cluster level so in future common other cluster configs can be placed if needed. bq.Apart from above API's , should there any new API's to be added? We can change default priority of a queue by changing capacity-scheduler.xml and call refreshQueues. I feel we may not need a command for that now. bq../yarn application -set-priority ApplicationId --priority value I feel we can have {{./yarn application --setPriority ApplicationId --priority value}} I was trying to sync with existing application commands {{-appStates}} {{-appTypes}} cc/[~jianhe] [~leftnoteasy] Please share your thoughts. Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3993) Change to use the AM flag in ContainerContext determine AM container
[ https://issues.apache.org/jira/browse/YARN-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646558#comment-14646558 ] Zhijie Shen commented on YARN-3993: --- [~sunilg], thanks for your interest. It's not related to RM. In YARN-3116, we already build the channel to propagate the AM flag to aux service. What we need to do here is simply update the way that PerNodeTimelineCollectorsAuxService determine if the container is AM or not. Change to use the AM flag in ContainerContext determine AM container Key: YARN-3993 URL: https://issues.apache.org/jira/browse/YARN-3993 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen After YARN-3116, we will have a flag in ContainerContext to determine if the container is AM or not in aux service. We need to change accordingly to make use of this feature instead of depending on container ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646622#comment-14646622 ] Rohith Sharma K S commented on YARN-3543: - I have one doubt that whether it is able to render on timeline web UI. I remember that these changes I did for timeline web UI fetching the data. Anyway I will verify it tomorrow and confirm does it required. ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3049) [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend
[ https://issues.apache.org/jira/browse/YARN-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646893#comment-14646893 ] Li Lu commented on YARN-3049: - Hi [~zjshen], some of my comments: - The addition on {{newApp}} is to indicate if we need if we need to update the app2flow index table. This change is an interface change and it's slightly more than I thought. However, I still incline to proceed the changes in this JIRA so that we can speed up consolidating our POC patches. - FileSystemTimelineReaderImpl, in {{fillFields}}, maybe we can use EnumSet.allOf() to generate the universe of fields so that we can reuse the logic of the following for loop for Field.ALL? - Reader interface: use TimelineCollectorContext to package reader arguments? - HBaseTimelineReaderImpl: l.160 (all line numbers are after patch) {code} byte[] row = result.getRow(); {code} unused? l.213 name of private method {{getEntity}}: I think we may want to distinguish that with the external {{getEntity}} API. How about parseEntity or getEntitiFromResult? We're now performing filters by ourselves in memory. I'm wondering if it will be more efficient to translate some of our filter specifications into HBase filters? l.113, 136, 142: I'm a little bit worry about the {{0L}}s. Shall we have something like DEFAULT_TIME to make the argument list more readable? I assume the problem raised in l.369 (if the event come with no info, it will be missed) will be addressed after YARN-3984? - HBaseTimelineWriterImpl: l.121-122: The log information is unclear about the write happened onto the App2Flow table? Also, we may want to keep this message in debug level? - TimelineSchemaCreator: Why we are not adding {{a2f}} as an option, similar to what we did in l.94-102 for {{e}} and {{m}}? - App2FlowColumn: l.51, {{private}} appears to be redundant in enums. Similarly in l.42 or App2FlowColumnFamily. nits: - Name of App2FlowTable, AppToFlowTable? Saving one character every time is not quite helpful... - l. 248, 263, 336: I'm confused by the name readConnections... - Add a specific test in TestHBaseTimelineWriterImpl for App2FlowTable? [Storage Implementation] Implement storage reader interface to fetch raw data from HBase backend Key: YARN-3049 URL: https://issues.apache.org/jira/browse/YARN-3049 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Zhijie Shen Attachments: YARN-3049-WIP.1.patch, YARN-3049-WIP.2.patch, YARN-3049-WIP.3.patch, YARN-3049-YARN-2928.2.patch Implement existing ATS queries with the new ATS reader design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3974) Refactor the reservation system test cases to use parameterized base test
[ https://issues.apache.org/jira/browse/YARN-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646895#comment-14646895 ] Hadoop QA commented on YARN-3974: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 23m 43s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 10m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 16m 15s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 42s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 13s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 2m 13s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 46s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 3s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 54m 55s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 112m 41s | | \\ \\ || Reason || Tests || | Timed out tests | org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747854/YARN-3974-v4.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c020b62 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8710/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8710/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8710/console | This message was automatically generated. Refactor the reservation system test cases to use parameterized base test - Key: YARN-3974 URL: https://issues.apache.org/jira/browse/YARN-3974 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Attachments: YARN-3974-v1.patch, YARN-3974-v2.patch, YARN-3974-v3.patch, YARN-3974-v4.patch We have two test suites for testing ReservationSystem against Capacity Fair scheduler. We should combine them using a parametrized reservation system base test similar to YARN-2797 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646924#comment-14646924 ] Zhijie Shen commented on YARN-3984: --- [~vrushalic], thanks for picking it up. The aforementioned cases are definitely good to support, while the current query we want to support now (in YARN-3051 and YARN-3049) is to retrieve all events belonging to an entity (e.g. application, attempt, container and etc.). With this basic query, we can easily distill the details that happen to the entity, such as the diagnostic msg of the kill event. In this case, the most efficient way is to put timestamp even before the event ID, so that we don't need to order the events in memory. In addition to the key composition, I find another significant problem with the event store schema. If the event doesn't contain any info, it will be ignored then. And we cannot always guarantee user will put something into info. For example, user may define a KILL event without any diagnostic msg. Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646930#comment-14646930 ] Sangjin Lee commented on YARN-3984: --- {quote} In addition to the key composition, I find another significant problem with the event store schema. If the event doesn't contain any info, it will be ignored then. And we cannot always guarantee user will put something into info. For example, user may define a KILL event without any diagnostic msg. {quote} Thanks for spotting that issue [~zjshen]. That's definitely a huge issue. We should address that as part of this JIRA... Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3904: Attachment: YARN-3904-YARN-2928.005.patch Refreshed my patch according to [~sjlee0]'s comments. Specifically, I set up a new interface (OfflineAggregationWriter) for aggregation writers. With this new interface I decoupled PhoenixOfflineAggregationWriter from TimelineWriter. Having a separate offline writer interface also gives us more freedom to design the aggregation storage interface. Now in the new writer API the type of the offline aggregation is specified by the incoming {{OfflineAggregationInfo}}. I also considered to combine reader and writer interfaces into a OfflineAggregationStorage interface, but it turned out that we may have some reader-only implementations (such as reading app level aggregations from HBase). Separating offline readers and writers will give us more freedom in this case. Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3814) REST API implementation for getting raw entities in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647023#comment-14647023 ] Hadoop QA commented on YARN-3814: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747894/YARN-3814.reference.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / ddc867ce | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8712/console | This message was automatically generated. REST API implementation for getting raw entities in TimelineReader -- Key: YARN-3814 URL: https://issues.apache.org/jira/browse/YARN-3814 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-3814-YARN-2928.01.patch, YARN-3814-YARN-2928.02.patch, YARN-3814.reference.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3994) RM should respect AM resource/placement constraints
[ https://issues.apache.org/jira/browse/YARN-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-3994: Assignee: (was: Varun Vasudev) RM should respect AM resource/placement constraints --- Key: YARN-3994 URL: https://issues.apache.org/jira/browse/YARN-3994 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Today, locality and cpu for the AM can be specified in the AM launch container request but are ignored at the RM. Locality is assumed to be ANY and cpu is dropped. There may be other things too that are ignored. This should be fixed so that the user gets what is specified in their code to launch the AM. cc [~leftnoteasy] [~vvasudev] [~adhoot] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3984) Rethink event column key issue
[ https://issues.apache.org/jira/browse/YARN-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646935#comment-14646935 ] Zhijie Shen commented on YARN-3984: --- In fact, metric has the same problem, but it may be still okay to ignore a metric without any data. Rethink event column key issue -- Key: YARN-3984 URL: https://issues.apache.org/jira/browse/YARN-3984 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Vrushali C Fix For: YARN-2928 Currently, the event column key is event_id?info_key?timestamp, which is not so friendly to fetching all the events of an entity and sorting them in a chronologic order. IMHO, timestamp?event_id?info_key may be a better key schema. I open this jira to continue the discussion about it which was commented on YARN-3908. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3945) maxApplicationsPerUser is wrongly calculated
[ https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646975#comment-14646975 ] Wangda Tan commented on YARN-3945: -- And forgot to mention, maxApplicationsPerUser computation is a byproduct of user-limit, I would like to see if we can reach some consent about change/not-change user-limit before fixing maxApplicationPerUser based on existing user-limit assumptions. maxApplicationsPerUser is wrongly calculated Key: YARN-3945 URL: https://issues.apache.org/jira/browse/YARN-3945 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3945.20150728-1.patch, YARN-3945.20150729-1.patch maxApplicationsPerUser is currently calculated based on the formula {{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) * userLimitFactor)}} but description of userlimit is {quote} Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value.{color:red} The the former (the minimum value) is set to this property value {color} and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. {quote} configuration related to minimum limit should not be made used in a formula to calculate max applications for a user -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647017#comment-14647017 ] Naganarasimha G R commented on YARN-3995: - Two approaches were discussed till now : # we can have timer task which periodically cleans up collector after some period and not imm remove it when AM container is finished. # When RM finishes the attempt then it can send one finish event through timelineclient for the ApplicationEntity which is kind of a marker based on which NM's TimelineCollectorManager can act upon. Some of the NM events are not getting published due race condition when AM container finishes in NM Key: YARN-3995 URL: https://issues.apache.org/jira/browse/YARN-3995 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, timelineserver Affects Versions: YARN-2928 Reporter: Naganarasimha G R Assignee: Naganarasimha G R As discussed in YARN-3045: While testing in TestDistributedShell found out that few of the container metrics events were failing as there will be race condition. When the AM container finishes and removes the collector for the app, still there is possibility that all the events published for the app by the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3945) maxApplicationsPerUser is wrongly calculated
[ https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646964#comment-14646964 ] Wangda Tan commented on YARN-3945: -- Thanks for summarizing [~Naganarasimha]. I think we *might* need to reconsider user-limit / user-limit-factor configuration. I can also see it's hard to be understood: - User-limit is not a lower bound nor higher bound. - User-limit is not a fairness mechanism to balance resources between users, instead, it can lead to bad imbalance. One example is, if we set user-limit = 50, and there're 10 users running, we cannot manage how much resource can be used by each user. - It's really hard to understand, I spent time working on CapacityScheduler almost everyday, but sometimes I will forget and need to look at code to see how it is computed. :-(. Basically User-limit is computed by: {{user-limit = {{min(queue-capacity * user-limit-factor, current-capacity * max(user-limit / 100, 1 / #active-user)}}. But this formula is not that meaningful since #active-user is changing every minute, it is not a predictable formula. Instead we may need to consider some notion like fair sharing: user-limit-factor becomes max-resource-limit of each user, and user-limit-percentage becomes something like guaranteed-concurrent-#user, when #user guaranteed-concurrent-#user, rest users can only get idle shares. With this approach, and considering we have user-limit-preemption within a queue (YARN-2113), we can get a predictable user-limit. Thoughts? [~nroberts], [~jlowe]. maxApplicationsPerUser is wrongly calculated Key: YARN-3945 URL: https://issues.apache.org/jira/browse/YARN-3945 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3945.20150728-1.patch, YARN-3945.20150729-1.patch maxApplicationsPerUser is currently calculated based on the formula {{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) * userLimitFactor)}} but description of userlimit is {quote} Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value.{color:red} The the former (the minimum value) is set to this property value {color} and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. {quote} configuration related to minimum limit should not be made used in a formula to calculate max applications for a user -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646992#comment-14646992 ] Naganarasimha G R commented on YARN-3045: - Thanks for the comments [~djp], bq. We already have a new flush() API now for writer that checked in YARN-3949... You are right that we are lacking of API to respect this priority/policy in the whole data flow for writing. I will file another JIRA to track this. I went through the discussions and the patch of YARN-3949, i feel calling two apis would be not so user friendly and how will the users of TimelineClient call flush ? i think its not captured in YARN-3949 bq. Anyway, I would support the scope (container events + foundation work) you proposed here in case you are comfortable with. I am fine with single jira, but only trouble is as and when the scope increases there will be more delay in the jira as more discussions will be required(in this case which entity to publish NM App localization events ) and also as its been long since i am holding this jira so thought of getting the basic one out and develop on top of it. I am ok if you want to avoid multiple jira's. bq. That's a good question. My initative thinking is we could need something like NodemanagerEntity to store application events, resource localizaiton event, log aggregation handling events, configuration, etc. However, I would like to hear you and other guys' ideas on this as well. We had a discussion on this topic today in the meeting and [~sjlee0] was of the opinion not to have another entity here. I think we need more discussions on this as it involves querying too. Approach what i can think of is : * For Applicationlevel events in NM can be under ApplicationEntity and EventID can have Event Type (INIT_APPLICATION/APPLICATION_FINISHED/APPLICATION_LOG_HANDLING_FAILED) and NM_ID * For Localization i feel it can be under ContainerEntity and the EventID can have Event Type (REQUEST,LOCALIZED,LOCALIZATION_FAILED)and PATH of the localized resource. bq. IMO, the 2nd approach (hook to existing event dispatcher) looks simpler and straightforward. This approach is straight fwd but not sure it might have impact( just initial apprehensions) but will start of implementing for container events and share the initial patch based on this approach. [Event producers] Implement NM writing container lifecycle events to ATS Key: YARN-3045 URL: https://issues.apache.org/jira/browse/YARN-3045 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3045-YARN-2928.002.patch, YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, YARN-3045-YARN-2928.005.patch, YARN-3045-YARN-2928.006.patch, YARN-3045.20150420-1.patch Per design in YARN-2928, implement NM writing container lifecycle events and container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305
Anubhav Dhoot created YARN-3996: --- Summary: YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305 Key: YARN-3996 URL: https://issues.apache.org/jira/browse/YARN-3996 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest with mininumResource for the incrementResource. This causes normalize to return zero if minimum is set to zero as per YARN-789 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3920) FairScheduler Reserving a node for a container should be configurable to allow it used only for large containers
[ https://issues.apache.org/jira/browse/YARN-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647045#comment-14647045 ] Arun Suresh commented on YARN-3920: --- Thanks for the patch [~adhoot], The patch looks pretty straight forward to me and the test case looks good. My only minor comment is, Maybe we can expose this as an absolute value, rather than a ratio and the {{isReservable()}} function will just take min(ReservationThreshold, MaxCapability). I am ok either way though +1 pending above decision FairScheduler Reserving a node for a container should be configurable to allow it used only for large containers Key: YARN-3920 URL: https://issues.apache.org/jira/browse/YARN-3920 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: yARN-3920.001.patch Reserving a node for a container was designed for preventing large containers from starvation from small requests that keep getting into a node. Today we let this be used even for a small container request. This has a huge impact on scheduling since we block other scheduling requests until that reservation is fulfilled. We should make this configurable so its impact can be minimized by limiting it for large container requests as originally intended. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3974) Refactor the reservation system test cases to use parameterized base test
[ https://issues.apache.org/jira/browse/YARN-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647070#comment-14647070 ] Anubhav Dhoot commented on YARN-3974: - LGTM Refactor the reservation system test cases to use parameterized base test - Key: YARN-3974 URL: https://issues.apache.org/jira/browse/YARN-3974 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Attachments: YARN-3974-v1.patch, YARN-3974-v2.patch, YARN-3974-v3.patch, YARN-3974-v4.patch We have two test suites for testing ReservationSystem against Capacity Fair scheduler. We should combine them using a parametrized reservation system base test similar to YARN-2797 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647095#comment-14647095 ] zhangyubiao commented on YARN-3979: --- the cluster is about 1600。and about 550 apps running. 2 lakh apps completed . NodeManager in one times all lost and recovery for a monment 。 I use Hadoop-2.2.0 in CentOS 6.5 Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647132#comment-14647132 ] Li Lu commented on YARN-3904: - The two failed tests passed on my local machine, and the failures appeared to be irrelevant. This said, we may still need to fix those intermittent test failures. Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647130#comment-14647130 ] zhangyubiao commented on YARN-3979: --- I had send you an email of RM Jstack log and I wil send your app log soon Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647163#comment-14647163 ] Sangjin Lee commented on YARN-3045: --- bq. I went through the discussions and the patch of YARN-3949, i feel calling two apis would be not so user friendly and how will the users of TimelineClient call flush ? i think its not captured in YARN-3949 We did discuss it in that JIRA. See [this comment|https://issues.apache.org/jira/browse/YARN-3949?focusedCommentId=14640959page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640959] for instance. Note that the user of those two methods is really {{TimelineCollector}}. I don't think we'd be exposing {{flush()}} to {{TimelineClient}}. The synchronous nature of the writes would be expressed differently on {{TimelineClient}}. bq. We had a discussion on this topic today in the meeting and Sangjin Lee was of the opinion not to have another entity here. I think we need more discussions on this as it involves querying too. To elaborate it a little further, creating a new entity type just to capture different origins of application events seems bit too much. These are really events that belong to YARN applications, and I don't see why they shouldn't be part of the YARN application entities. It also simplifies the query model. When you query for a YARN application entity, you get all application events, regardless of whether they originate from RM or NMs. That's a much nicer interaction for querying for a YARN app. [Event producers] Implement NM writing container lifecycle events to ATS Key: YARN-3045 URL: https://issues.apache.org/jira/browse/YARN-3045 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3045-YARN-2928.002.patch, YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, YARN-3045-YARN-2928.005.patch, YARN-3045-YARN-2928.006.patch, YARN-3045.20150420-1.patch Per design in YARN-2928, implement NM writing container lifecycle events and container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2884) Proxying all AM-RM communications
[ https://issues.apache.org/jira/browse/YARN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647075#comment-14647075 ] Hadoop QA commented on YARN-2884: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 21m 7s | Pre-patch trunk has 6 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 6 new or modified test files. | | {color:green}+1{color} | javac | 8m 25s | There were no new javac warning messages. | | {color:red}-1{color} | javadoc | 11m 29s | The applied patch generated 4 additional warning messages. | | {color:green}+1{color} | release audit | 0m 27s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 45s | The applied patch generated 1 new checkstyle issues (total was 237, now 237). | | {color:green}+1{color} | whitespace | 0m 2s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 49s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 42s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 8m 30s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 10s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 0m 26s | Tests passed in hadoop-yarn-server-common. | | {color:red}-1{color} | yarn tests | 6m 33s | Tests failed in hadoop-yarn-server-nodemanager. | | {color:green}+1{color} | yarn tests | 55m 5s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 121m 34s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestDeletionService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747878/YARN-2884-V6.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c020b62 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-common.html | | javadoc | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/diffJavadocWarnings.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8711/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8711/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8711/console | This message was automatically generated. Proxying all AM-RM communications - Key: YARN-2884 URL: https://issues.apache.org/jira/browse/YARN-2884 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Carlo Curino Assignee: Kishore Chaliparambil Attachments: YARN-2884-V1.patch, YARN-2884-V2.patch, YARN-2884-V3.patch, YARN-2884-V4.patch, YARN-2884-V5.patch, YARN-2884-V6.patch We introduce the notion of an RMProxy, running on each node (or once per rack). Upon start the AM is forced (via tokens and configuration) to direct all its requests to a new services running on the NM that provide a proxy to the central RM. This give us a place to: 1) perform distributed scheduling decisions 2) throttling mis-behaving AMs 3) mask the access to a federation of RMs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647177#comment-14647177 ] Rohith Sharma K S commented on YARN-3979: - Thanks for the information!! bq. NodeManager in one times all lost and recovery for a monment I can think of the scenario very close to YARN-3990. Since you have 2 lakh apps completed and 1600 NodeManager, when the all the nodes lost and reconnected, the number of events that generated are {{(2lakh completed + 550 running = 200550)*1600(number of NodeManager) = 32088}} events..Ooops!!! Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3904) Refactor timelineservice.storage to add support to online and offline aggregation writers
[ https://issues.apache.org/jira/browse/YARN-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647074#comment-14647074 ] Hadoop QA commented on YARN-3904: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 15m 36s | Findbugs (version ) appears to be broken on YARN-2928. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 5 new or modified test files. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 42s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 14s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 27s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 39s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 46s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 7m 58s | Tests failed in hadoop-yarn-server-timelineservice. | | | | 44m 48s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.timelineservice.storage.TestHBaseTimelineWriterImpl | | | hadoop.yarn.server.timelineservice.storage.TestPhoenixOfflineAggregationWriterImpl | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747900/YARN-3904-YARN-2928.005.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | YARN-2928 / df0ec47 | | hadoop-yarn-server-timelineservice test log | https://builds.apache.org/job/PreCommit-YARN-Build/8713/artifact/patchprocess/testrun_hadoop-yarn-server-timelineservice.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8713/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8713/console | This message was automatically generated. Refactor timelineservice.storage to add support to online and offline aggregation writers - Key: YARN-3904 URL: https://issues.apache.org/jira/browse/YARN-3904 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Li Lu Assignee: Li Lu Attachments: YARN-3904-YARN-2928.001.patch, YARN-3904-YARN-2928.002.patch, YARN-3904-YARN-2928.003.patch, YARN-3904-YARN-2928.004.patch, YARN-3904-YARN-2928.005.patch After we finished the design for time-based aggregation, we can adopt our existing Phoenix storage into the storage of the aggregated data. In this JIRA, I'm proposing to refactor writers to add support to aggregation writers. Offline aggregation writers typically has less contextual information. We can distinguish these writers by special naming. We can also use CollectorContexts to model all contextual information and use it in our writer interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3983) Make CapacityScheduler to easier extend application allocation logic
[ https://issues.apache.org/jira/browse/YARN-3983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647150#comment-14647150 ] Jian He commented on YARN-3983: --- thanks Wangda ! some comments on the patch - ApplicationResourceAllocator - ContainerAllocator - NewContainerAllocator - RegularContainerAllocator - internalPreAllocation - preAllocate - move the assingContainersOnNode into internalApplyAllocation - internalApplyAllocation - doAllocation - doAllocation - allocate - AllocatorAllocationResult - ContainerAllocation - SKIPPED_APP - SKIP_APP; similarly for others - this.resourceToBeAllocated can be set null; the caller can check whether null or not {code} if (resourceToBeAllocated == null) { this.resourceToBeAllocated = Resources.none(); } else { this.resourceToBeAllocated = resourceToBeAllocated; } {code} - AllocatorAllocationResult#allocateNodeType - AllocatorAllocationResult#containerNodeType - Fix FiCaSchedulerApp#assignContainer method format and remove the unused createdContainer parameter - handleNewContainerReservation does not need be a separate method; - getCSAssignmentFromAllocateResult can be part of doAllocation. Make CapacityScheduler to easier extend application allocation logic Key: YARN-3983 URL: https://issues.apache.org/jira/browse/YARN-3983 Project: Hadoop YARN Issue Type: Bug Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3983.1.patch While working on YARN-1651 (resource allocation for increasing container), I found it is very hard to extend existing CapacityScheduler resource allocation logic to support different types of resource allocation. For example, there's a lot of differences between increasing a container and allocating a container: - Increasing a container doesn't need to check locality delay. - Increasing a container doesn't need to build/modify a resource request tree (ANY-RACK/HOST). - Increasing a container doesn't need to check allocation/reservation starvation (see {{shouldAllocOrReserveNewContainer}}). - After increasing a container is approved by scheduler, it need to update an existing container token instead of creating new container. And there're lots of similarities when allocating different types of resources. - User-limit/queue-limit will be enforced for both of them. - Both of them needs resource reservation logic. (Maybe continuous reservation looking is needed for both of them). The purpose of this JIRA is to make easier extending CapacityScheduler resource allocation logic to support different types of resource allocation, make common code reusable, and also better code organization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected
Rohith Sharma K S created YARN-3990: --- Summary: AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Priority: Critical Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645795#comment-14645795 ] Bibin A Chundatt commented on YARN-3990: [~rohithsharma] Yes,Currently i have submitted about 50K+ apps and {{yarn.resourcemanager.max-completed-applications}} is set to 20K. {{yarn.resourcemanager.state-store.max-completed-applications}} default ={{yarn.resourcemanager.max-completed-applications}} AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645601#comment-14645601 ] zhangyubiao commented on YARN-3979: --- Thank you for reply @Rohith Sharma K S Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangyubiao updated YARN-3979: -- Attachment: ERROR103.log the RM Log is so large So I grep ERROR for logs Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3990: Summary: AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected (was: AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected ) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt reassigned YARN-3990: -- Assignee: Bibin A Chundatt AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-3990: --- Attachment: 0001-YARN-3990.patch [~rohithsharma] Attaching patch for initial review AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645535#comment-14645535 ] Rohith Sharma K S commented on YARN-3979: - How many applications completed? How many applications are running? How many NM are running? When is this event queeu is full? Any observation you made? Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645567#comment-14645567 ] Bibin A Chundatt commented on YARN-3990: [~rohithsharma] {code} 2015-07-29 19:39:03,409 | INFO | ResourceManager Event Processor | Added node host-7:26009 clusterResource: memory:178400, vCores:64 | CapacityScheduler.java:1358 2015-07-29 19:39:03,409 | INFO | AsyncDispatcher event handler | Size of event-queue is 3000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,409 | DEBUG | Socket Reader #1 for port 26003 | got #2125 | Server.java:1790 2015-07-29 19:39:03,409 | DEBUG | IPC Server handler 7 on 26003 | IPC Server handler 7 on 26003: org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 172.168.100.7:24999 Call#2125 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER | Server.java:2058 2015-07-29 19:39:03,410 | DEBUG | IPC Server handler 7 on 26003 | PrivilegedAction as:mapred/hadoop.hadoop@hadoop.com (auth:KERBEROS) from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2082) | UserGroupInformation.java:1696 2015-07-29 19:39:03,410 | INFO | AsyncDispatcher event handler | Size of event-queue is 4000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,410 | INFO | AsyncDispatcher event handler | Size of event-queue is 5000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,411 | INFO | AsyncDispatcher event handler | Size of event-queue is 6000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,412 | INFO | AsyncDispatcher event handler | Size of event-queue is 7000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,412 | INFO | IPC Server handler 7 on 26003 | Size of event-queue is 7000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,412 | INFO | AsyncDispatcher event handler | Size of event-queue is 8000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,413 | INFO | AsyncDispatcher event handler | Size of event-queue is 9000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,414 | INFO | AsyncDispatcher event handler | Size of event-queue is 1 | AsyncDispatcher.java:235 2015-07-29 19:39:03,414 | INFO | AsyncDispatcher event handler | Size of event-queue is 11000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,415 | DEBUG | IPC Server handler 7 on 26003 | Served: nodeHeartbeat queueTime= 1 procesingTime= 5 | ProtobufRpcEngine.java:631 2015-07-29 19:39:03,415 | INFO | AsyncDispatcher event handler | Size of event-queue is 12000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,416 | DEBUG | IPC Server handler 7 on 26003 | Adding saslServer wrapped token of size 100 as call response. | Server.java:2460 2015-07-29 19:39:03,416 | DEBUG | IPC Server handler 7 on 26003 | IPC Server handler 7 on 26003: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 172.168.100.7:24999 Call#2125 Retry#0 | Server.java:994 2015-07-29 19:39:03,416 | INFO | AsyncDispatcher event handler | Size of event-queue is 13000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,416 | DEBUG | IPC Server handler 7 on 26003 | IPC Server handler 7 on 26003: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 172.168.100.7:24999 Call#2125 Retry#0 Wrote 118 bytes. | Server.java:1013 2015-07-29 19:39:03,416 | INFO | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,417 | INFO | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,418 | INFO | AsyncDispatcher event handler | Size of event-queue is 16000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,419 | INFO | AsyncDispatcher event handler | Size of event-queue is 17000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,419 | INFO | AsyncDispatcher event handler | Size of event-queue is 18000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,420 | INFO | AsyncDispatcher event handler | Size of event-queue is 19000 | AsyncDispatcher.java:235 2015-07-29 19:39:03,421 | INFO | AsyncDispatcher event handler | Size of event-queue is 2 | AsyncDispatcher.java:235 2015-07-29 19:39:03,421 | DEBUG | AsyncDispatcher event handler | Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppNodeUpdateEvent.EventType: NODE_UPDATE | AsyncDispatcher.java:166 2015-07-29 19:39:03,421 | DEBUG | AsyncDispatcher event handler | Processing event for application_1438101193238_224125 of type NODE_UPDATE | RMAppImpl.java:741 2015-07-29 19:39:03,421 | DEBUG | AsyncDispatcher event handler | Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppNodeUpdateEvent.EventType: NODE_UPDATE | AsyncDispatcher.java:166 2015-07-29 19:39:03,421 | DEBUG | AsyncDispatcher event handler | Processing event for application_1438101193238_224126 of type NODE_UPDATE | RMAppImpl.java:741 2015-07-29 19:39:03,422 | DEBUG | AsyncDispatcher event handler |
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645566#comment-14645566 ] Rohith Sharma K S commented on YARN-3887: - Your understanding is correct. I was meant to say to have new synchronous API like {{updateApplicationStateSynchronizly}} in RMStateStore. [~jianhe] what do you think having new synchronous api in RMstatstore? Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645583#comment-14645583 ] Rohith Sharma K S commented on YARN-3990: - thanks [~bibinchundatt] for reproducing the issue. I believe in you clustesr appsCompleted/appsRunning are 2 and max number of completed apps to keep is set to 20k? AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3945) maxApplicationsPerUser is wrongly calculated
[ https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3945: Attachment: YARN-3945.20150729-1.patch Oops My Mistake, corrected the patch to remove javac warnings... maxApplicationsPerUser is wrongly calculated Key: YARN-3945 URL: https://issues.apache.org/jira/browse/YARN-3945 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3945.20150728-1.patch, YARN-3945.20150729-1.patch maxApplicationsPerUser is currently calculated based on the formula {{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) * userLimitFactor)}} but description of userlimit is {quote} Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value.{color:red} The the former (the minimum value) is set to this property value {color} and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. {quote} configuration related to minimum limit should not be made used in a formula to calculate max applications for a user -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3989) Show messages only for NodeLabel commmands in RMAdminCLI
[ https://issues.apache.org/jira/browse/YARN-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645870#comment-14645870 ] Naganarasimha G R commented on YARN-3989: - hi [~bibinchundatt] [~sunilg], Is the purpose of this jira to remove the stack trace and show only the exception message ? If so i think multiple places for multiple commands we need to be handle and just for RMAdminCLI and NodeLabel commands And also sometimes it might become difficult for developers to look into the issue and resolve it if the stacktrace is removed, i think we should have some flexible way as in case of operations too verbose will not be intended but in case of development it will be helpfull. Thoughts ? Show messages only for NodeLabel commmands in RMAdminCLI Key: YARN-3989 URL: https://issues.apache.org/jira/browse/YARN-3989 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Minor Currently for nodelabel command execution failure full exception stacktrace is shown. This jira is to handle exceptions and show only messages. As per the discussion in YARN-3963 [~sunilg] {quote} As I see it, I can see full exception stack trace in client side in this case (also in case of other commands too) and its too verbose. I think we can make its compact and n it will be more easily readable. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3945) maxApplicationsPerUser is wrongly calculated
[ https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645959#comment-14645959 ] Hadoop QA commented on YARN-3945: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 0s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 43s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 48s | The applied patch generated 1 new checkstyle issues (total was 92, now 91). | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 52m 13s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 90m 5s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747768/YARN-3945.20150729-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 6374ee0 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8704/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8704/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8704/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8704/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8704/console | This message was automatically generated. maxApplicationsPerUser is wrongly calculated Key: YARN-3945 URL: https://issues.apache.org/jira/browse/YARN-3945 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3945.20150728-1.patch, YARN-3945.20150729-1.patch maxApplicationsPerUser is currently calculated based on the formula {{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) * userLimitFactor)}} but description of userlimit is {quote} Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value.{color:red} The the former (the minimum value) is set to this property value {color} and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. {quote} configuration related to minimum limit should not be made used in a formula to calculate max applications for a user -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645981#comment-14645981 ] Hadoop QA commented on YARN-3990: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 5s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 43s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 21s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 48s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 18s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 52m 1s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 89m 55s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12747731/0001-YARN-3990.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 6374ee0 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8705/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8705/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8705/console | This message was automatically generated. AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3990.patch Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3945) maxApplicationsPerUser is wrongly calculated
[ https://issues.apache.org/jira/browse/YARN-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646048#comment-14646048 ] Naganarasimha G R commented on YARN-3945: - [~wangda] [~nroberts], Checkstyle is not accurate as eclipse code format template is as per the coding guidelines wiki. and white space is not exactly in the lines which are modified but can get it corrected along with other review comments and doc updates. Can you please check the implementation and my comments on doc so that i can modify as required ? maxApplicationsPerUser is wrongly calculated Key: YARN-3945 URL: https://issues.apache.org/jira/browse/YARN-3945 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.1 Reporter: Naganarasimha G R Assignee: Naganarasimha G R Attachments: YARN-3945.20150728-1.patch, YARN-3945.20150729-1.patch maxApplicationsPerUser is currently calculated based on the formula {{maxApplicationsPerUser = (int)(maxApplications * (userLimit / 100.0f) * userLimitFactor)}} but description of userlimit is {quote} Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value.{color:red} The the former (the minimum value) is set to this property value {color} and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. {quote} configuration related to minimum limit should not be made used in a formula to calculate max applications for a user -- This message was sent by Atlassian JIRA (v6.3.4#6332)