[
https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sibyl.lv updated YARN-10517:
----------------------------
Description:
After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has
incorrect allocated jmx, such as {color:#660e7a}allocatedMB,
{color}{color:#660e7a}allocatedVCores and
{color}{color:#660e7a}allocatedContainers, {color}when the node partition is
updated from "DEFAULT" to other label and there are running applications.
Steps to reproduce
==============
# Configure capacity-scheduler.xml with label configuration
# Submit one application to default partition and run
# Add label "tpcds" to cluster and replace label on node1 and node2 to be
"tpcds" when the above application is running
# Note down "VCores Used" at Web UI
# When the application is finished, the metrics get wrong (screenshots
attached).
==============
FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles
this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
So we should release container resource from old partition and add used
resource to new partition, just as updating queueUsage.
{code:java}
// code placeholder
public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
String newPartition) {
Resource containerResource = rmContainer.getAllocatedResource();
this.attemptResourceUsage.decUsed(oldPartition, containerResource);
this.attemptResourceUsage.incUsed(newPartition, containerResource);
getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
// Update new partition name if container is AM and also update AM resource
if (rmContainer.isAMContainer()) {
setAppAMNodePartitionName(newPartition);
this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
}
}
{code}
was:
After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has
incorrect allocated jmx, such as {color:#660e7a}allocatedMB,
{color}{color:#660e7a}allocatedVCores and
{color}{color:#660e7a}allocatedContainers, {color}when the node partition is
updated from "DEFAULT" to other label and there are running applications.
Steps to reproduce
==============
# Configure capacity-scheduler.xml with label configuration
# Submit one application to default partition and run
# Add label "tpcds" to cluster and replace label on node1 and node2 to be
"tpcds" when the above application is running
# Note down "VCores Used" at Web UI
# When the application is finished, the metrics get wrong (screenshots
attached).
> QueueMetrics has incorrect Allocated Resource when labelled partitions updated
> ------------------------------------------------------------------------------
>
> Key: YARN-10517
> URL: https://issues.apache.org/jira/browse/YARN-10517
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.8.0, 3.3.0
> Reporter: sibyl.lv
> Priority: Major
> Fix For: 3.3.1, 3.2.3
>
> Attachments: YARN-10517-branch-3.2.001.patch, wrong metrics.png
>
>
> After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has
> incorrect allocated jmx, such as {color:#660e7a}allocatedMB,
> {color}{color:#660e7a}allocatedVCores and
> {color}{color:#660e7a}allocatedContainers, {color}when the node partition is
> updated from "DEFAULT" to other label and there are running applications.
> Steps to reproduce
> ==============
> # Configure capacity-scheduler.xml with label configuration
> # Submit one application to default partition and run
> # Add label "tpcds" to cluster and replace label on node1 and node2 to be
> "tpcds" when the above application is running
> # Note down "VCores Used" at Web UI
> # When the application is finished, the metrics get wrong (screenshots
> attached).
> ==============
>
> FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles
> this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
> So we should release container resource from old partition and add used
> resource to new partition, just as updating queueUsage.
> {code:java}
> // code placeholder
> public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
> String newPartition) {
> Resource containerResource = rmContainer.getAllocatedResource();
> this.attemptResourceUsage.decUsed(oldPartition, containerResource);
> this.attemptResourceUsage.incUsed(newPartition, containerResource);
> getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
> getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
> // Update new partition name if container is AM and also update AM resource
> if (rmContainer.isAMContainer()) {
> setAppAMNodePartitionName(newPartition);
> this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
> this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
> getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
> getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]