[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914888#comment-16914888 ] Amithsha commented on YARN-9596: Hi All, i am getting the following Exception in 2.9.0 Is this related to the above issue 2019-08-22 23:59:20,180 FATAL event.EventDispatcher (?:?(?)) - Error in handling event type NODE_UPDATE to the Event Dispatcher java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:301) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1346) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1341) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1430) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1205) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1067) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1472) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:151) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:745) > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 3.2.1, 2.9.3, 3.1.3 > > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce >
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896447#comment-16896447 ] Hudson commented on YARN-9596: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17009 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17009/]) YARN-9596: QueueMetrics has incorrect metrics when labelled partitions (ericp: rev 42683aef1a694af883c14842bf41f30b91e039f3) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestNodeLabelContainerAllocation.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896409#comment-16896409 ] Eric Payne commented on YARN-9596: -- Thanks for all your work on this, [~samkhan]. +1 to the backports as well. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893023#comment-16893023 ] Muhammad Samir Khan commented on YARN-9596: --- Created YARN-9702 for backporting YARN-5788 to branch-2.8. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892966#comment-16892966 ] Muhammad Samir Khan commented on YARN-9596: --- Also seeing the unit test failures and errors on branch-2.8 without the patch. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892380#comment-16892380 ] Hadoop QA commented on YARN-9596: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 8m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.8 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 57s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s{color} | {color:green} branch-2.8 passed with JDK v1.8.0_212 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s{color} | {color:green} branch-2.8 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} branch-2.8 passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} branch-2.8 passed with JDK v1.8.0_212 {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s{color} | {color:green} the patch passed with JDK v1.8.0_212 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed with JDK v1.8.0_212 {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 76m 48s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 19s{color} | {color:red} The patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}104m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | | hadoop.yarn.server.resourcemanager.TestAMAuthorization | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.0 Server=19.03.0 Image:yetus/hadoop:b93746a | | JIRA Issue | YARN-9596 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12975723/YARN-9596-branch-2.8.005.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 7ee4468e751d 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-2.8 / c07b626 | | maven | version: Apache Maven 3.3.9 | | Default Java |
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892265#comment-16892265 ] Muhammad Samir Khan commented on YARN-9596: --- Posted a patch for 2.8. It also includes a workaround in the unit test for race condition in AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375). For 2.8, we will also have to backport YARN-5788. Shall I post a patch here or should that be tracked separately? > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-2.8.005.patch, > YARN-9596-branch-3.0.004.patch, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892226#comment-16892226 ] Eric Payne commented on YARN-9596: -- bq. The unit test failures are also happening in branch-3.0. Yes, I see that now. I will continue to review the 3.0 patch Unfortunately, we will also need a branch-2.8 patch. It does not backport or apply cleanly to branch-2.8. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892191#comment-16892191 ] Muhammad Samir Khan commented on YARN-9596: --- The remaining two unit tests in TestNodeLabelContainerAllocation should have been fixed with YARN-7466 addendum patch but seems to be still broken in branch-3.0. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892122#comment-16892122 ] Muhammad Samir Khan commented on YARN-9596: --- YARN-4901 fixes some of the unit test failures but it is not in branch-3.0. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892081#comment-16892081 ] Muhammad Samir Khan commented on YARN-9596: --- The findbugs warnings are from branch-3.0 (pre-patch). The unit test failures are also happening in branch-3.0. They just happen a little later since the assert statement is later in branch-3.0. Some of the tests fail if I run all tests in TestNodeLabelContainerAllocation but not if I run the specific tests by themselves. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892065#comment-16892065 ] Eric Payne commented on YARN-9596: -- Thanks, [~samkhan], for the 3.0 patch. The test failures for {{TestOpportunisticContainerAllocatorAMService}} seem to be happening in 3.0 without this patch. However, the failures for {{TestNodeLabelContainerAllocation}} do seem to be caused by the 3.0 patch. I'm concerned about the findbugs warnings, but I am not sure why this patch would have caused them. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891986#comment-16891986 ] Eric Payne commented on YARN-9596: -- I'd like to document why a branch-3.0 patch was necessary. In trunk and 3.2, {{CSQueueUtils.java#getMaxAvailableResourceToQueue}} calculated {{totalAvailableResource}} as follows: {code:title=Trunk version of CSQueueUtils.java#getMaxAvailableResourceToQueue} Resource totalAvailableResource = Resources.createResource(0, 0); {code} So, the new {{getMaxAvailableResourceToQueuePartition}} method calculated the same way. However, when backporting to 3.0, {{totalAvailableResource}} should not be done the same way because it's different in 3.0: {code:title=3.0 version of CSQueueUtils.java#getMaxAvailableResourceToQueue} Resource queueGuranteedResource = Resources.multiply(nlm .getResourceByLabel(partition, cluster), queue.getQueueCapacities() .getAbsoluteCapacity(partition)); {code} > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891957#comment-16891957 ] Muhammad Samir Khan commented on YARN-9596: --- Looking at the UT failures. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891557#comment-16891557 ] Hadoop QA commented on YARN-9596: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 11m 28s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 17s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 21s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 23s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in branch-3.0 has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} branch-3.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 38s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 22s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}139m 57s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | | | hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.0 Server=19.03.0 Image:yetus/hadoop:e402791 | | JIRA Issue | YARN-9596 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12975566/YARN-9596-branch-3.0.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 1b527b339444 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.0 / 6aa76ea | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-YARN-Build/24419/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html | | unit |
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891432#comment-16891432 ] Muhammad Samir Khan commented on YARN-9596: --- {quote}[~eepayne] yes, the patch applies cleanly with the --3way option on git apply. For branch-2.8 though the unit test fails because of a race condition in AsyncDispatcher (see YARN-3878, YARN-5436, and YARN-5375) {quote} Due to whitespace changes between patch 002 and patch 003, the latest patch no longer applies cleanly to branch-3.0 and earlier versions. Uploaded a patch for that. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, > YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891186#comment-16891186 ] Manikandan R commented on YARN-9596: No more comments, [~eepayne]. Thanks. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891165#comment-16891165 ] Eric Payne commented on YARN-9596: -- Okay. Thanks a lot [~samkhan] for the good work on reporting and fixing this issue, and [~maniraj...@gmail.com] for the helpful reviews. I give my +1 [~maniraj...@gmail.com], do you have any further comments before I commit this? > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890284#comment-16890284 ] Muhammad Samir Khan commented on YARN-9596: --- CSQueueUtils#updateUsedCapacity is called before getMaxAvailableResourceToQueuePartition. So any checks for correct partition should be in CSQueueUtils#updateQueueStatistics so that it captures both the methods. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889407#comment-16889407 ] Manikandan R commented on YARN-9596: IIUC, This case would arise in this example: Node N has been mapped to Label X (Non exclusive). Queue A has been configured with ANY Node label. App A requested resources from Queue A and its containers ran on Node N for some reasons. During {{AbstractCSQueue#allocateResource}} call, Node partition (using {{SchedulerNode}} ) would get used for calculation. As you explained, code will calculate {{available}} to {{}} and there is no use as anyways {{QueueMetrics#setAvailableResourcesToQueue}} process only "default" partition as of now. But, with YARN-6492 coming in, I think we will revisit this particular {{if}} check as we need {{available}} for all used partitions. I guess, it would be bit tricky to do. {{Queue#getNodeLabelsForQueue}} will have only the labels for which some min and max resource has been configured. Based on the above example, metrics computation should happen for partition X also properly. Can you please validate my understanding? > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889183#comment-16889183 ] Eric Payne commented on YARN-9596: -- {quote}2. {code:java} if (queue.getNodeLabelsForQueue().contains(partition)){code} Is it required inCSQueueUtils#getMaxAvailableResourceToQueuePartition? {quote} It is true that if {{partition}} does not exist as a label in {{queue}}, the rest of the code will calculate {{available}} to {{}}. However, {{getMaxAvailableResourceToQueuePartition}} is called from {{updateQueueStatistics}} which is called from both {{AbstractCSQueue#allocateResource}} and {{AbstractCSQueue#releaseResource}}. The question is whether or not {{getMaxAvailableResourceToQueuePartition}} is ever called when {{partition}} is not valid for that queue. If it is, that will be a lot of extra Resources calculations every time a container is allocated and released. Are we sure that the {{partition}} will always be correct? > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887441#comment-16887441 ] Hadoop QA commented on YARN-9596: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 47s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 79m 48s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}136m 48s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=18.09.8 Server=18.09.8 Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9596 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12975070/YARN-9596.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux b6e75e2518b2 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 303a7f8 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24406/testReport/ | | Max. process+thread count | 865 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24406/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > QueueMetrics has incorrect metrics when labelled
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887198#comment-16887198 ] Muhammad Samir Khan commented on YARN-9596: --- Updated with changes. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch, > YARN-9596.003.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884569#comment-16884569 ] Manikandan R commented on YARN-9596: Minor nits: # CSQueueUtils#getMaxAvailableResourceToQueue can be removed. # {code:java} if (queue.getNodeLabelsForQueue().contains(partition)){code} Is it required inCSQueueUtils#getMaxAvailableResourceToQueuePartition? # {{nodePartition}} can be used instead of NO LABEL constant inside CSQueueUtils#updateUsedCapacity to update usedCapacity and absoluteUsedCapacity (though it is not related to this jira). > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868693#comment-16868693 ] Muhammad Samir Khan commented on YARN-9596: --- [~eepayne] yes, the patch applies cleanly with the --3way option on git apply. For branch-2.8 though the unit test fails because of a race condition in AsyncDispatcher (see [YARN-3878|[https://issues.apache.org/jira/browse/]YARN-3878], [YARN-5436|[https://issues.apache.org/jira/browse/]YARN-5436], and [YARN-5375|[https://issues.apache.org/jira/browse/]YARN-5375]) > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868264#comment-16868264 ] Manikandan R commented on YARN-9596: [~samkhan] [~eepayne] Sorry for the delay. Have been trying to understand this issue with respect to YARN-9088 as well and from other metrics perspective. Will post update asap. In the meantime, can you also take a look at those? > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868027#comment-16868027 ] Eric Payne commented on YARN-9596: -- [~samkhan], these changes look good to me for trunk. Do they backport to branch-2? > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.8.0, 3.3.0 >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858834#comment-16858834 ] Muhammad Samir Khan commented on YARN-9596: --- [~Naganarasimha] [~maniraj...@gmail.com] this is related to YARN-6467. Can you please take a look? Thanks. > QueueMetrics has incorrect metrics when labelled partitions are involved > > > Key: YARN-9596 > URL: https://issues.apache.org/jira/browse/YARN-9596 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Muhammad Samir Khan >Assignee: Muhammad Samir Khan >Priority: Major > Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot > 2019-06-03 at 4.44.15 PM.png, YARN-9596.001.patch, YARN-9596.002.patch > > > After YARN-6467, QueueMetrics should only be tracking metrics for the default > partition. However, the metrics are incorrect when labelled partitions are > involved. > Steps to reproduce > == > # Configure capacity-scheduler.xml with label configuration > # Add label "test" to cluster and replace label on node1 to be "test" > # Note down "totalMB" at > /ws/v1/cluster/metrics > # Start first job on test queue. > # Start second job on default queue (does not work if the order of two jobs > is swapped). > # While the two applications are running, the "totalMB" at > /ws/v1/cluster/metrics will go down by > the amount of MB used by the first job (screenshots attached). > Alternately: > In > TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(), > add the following line at the end of the test before rm1.close(): > CSQueue rootQueue = cs.getRootQueue(); > assertEquals(10*GB, > rootQueue.getMetrics().getAvailableMB() + > rootQueue.getMetrics().getAllocatedMB()); > There are two nodes of 10GB each and only one of them have a non-default > label. The test will also fail against 20*GB check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857033#comment-16857033 ] Hadoop QA commented on YARN-9596: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 16s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 47s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 83m 51s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}140m 41s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9596 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12970965/YARN-9596.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ec458f827157 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0b1e288 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24238/testReport/ | | Max. process+thread count | 872 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24238/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > QueueMetrics has incorrect metrics when labelled
[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved
[ https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855115#comment-16855115 ] Hadoop QA commented on YARN-9596: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 5s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 37s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 37s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 37s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 25s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 22 new + 110 unchanged - 0 fixed = 132 total (was 110) {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 37s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedclient {color} | {color:red} 3m 7s{color} | {color:red} patch has errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 29s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 38s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 39m 35s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9596 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12970748/YARN-9596.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 2a8c4ac1e613 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 277e9a8 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | mvninstall | https://builds.apache.org/job/PreCommit-YARN-Build/24212/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | compile |