[jira] [Updated] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

sibyl.lv (Jira) Fri, 04 Dec 2020 22:19:08 -0800


     [ 
https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sibyl.lv updated YARN-10517:
----------------------------
    Description: 
After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
{color}{color:#660e7a}allocatedVCores and 
{color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
updated from "DEFAULT" to other label and there are  running applications.

Steps to reproduce

==============
 # Configure capacity-scheduler.xml with label configuration
 # Submit one application to default partition and run
 # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
"tpcds" when the above application is running
 # Note down "VCores Used" at Web UI
 # When the application is finished, the metrics get wrong (screenshots 
attached).

==============

 

FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles 
this event {color:#660e7a}NODE_LABELS_UPDATE.{color}

So we should release container resource from old partition and add used 
resource to new partition, just as updating queueUsage.
{code:java}
// code placeholder
public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
    String newPartition) {
  Resource containerResource = rmContainer.getAllocatedResource();
  this.attemptResourceUsage.decUsed(oldPartition, containerResource);
  this.attemptResourceUsage.incUsed(newPartition, containerResource);
  getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
  getCSLeafQueue().incUsedResource(newPartition, containerResource, this);

  // Update new partition name if container is AM and also update AM resource
  if (rmContainer.isAMContainer()) {
    setAppAMNodePartitionName(newPartition);
    this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
    this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
    getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
    getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
  }
}
{code}

  was:
After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
{color}{color:#660e7a}allocatedVCores and 
{color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
updated from "DEFAULT" to other label and there are  running applications.

Steps to reproduce

==============
 # Configure capacity-scheduler.xml with label configuration
 # Submit one application to default partition and run
 # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
"tpcds" when the above application is running
 # Note down "VCores Used" at Web UI
 # When the application is finished, the metrics get wrong (screenshots 
attached).


> QueueMetrics has incorrect Allocated Resource when labelled partitions updated
> ------------------------------------------------------------------------------
>
>                 Key: YARN-10517
>                 URL: https://issues.apache.org/jira/browse/YARN-10517
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0, 3.3.0
>            Reporter: sibyl.lv
>            Priority: Major
>             Fix For: 3.3.1, 3.2.3
>
>         Attachments: YARN-10517-branch-3.2.001.patch, wrong metrics.png
>
>
> After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
> incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
> {color}{color:#660e7a}allocatedVCores and 
> {color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
> updated from "DEFAULT" to other label and there are  running applications.
> Steps to reproduce
> ==============
>  # Configure capacity-scheduler.xml with label configuration
>  # Submit one application to default partition and run
>  # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
> "tpcds" when the above application is running
>  # Note down "VCores Used" at Web UI
>  # When the application is finished, the metrics get wrong (screenshots 
> attached).
> ==============
>  
> FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles 
> this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
> So we should release container resource from old partition and add used 
> resource to new partition, just as updating queueUsage.
> {code:java}
> // code placeholder
> public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
>     String newPartition) {
>   Resource containerResource = rmContainer.getAllocatedResource();
>   this.attemptResourceUsage.decUsed(oldPartition, containerResource);
>   this.attemptResourceUsage.incUsed(newPartition, containerResource);
>   getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
>   getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
>   // Update new partition name if container is AM and also update AM resource
>   if (rmContainer.isAMContainer()) {
>     setAppAMNodePartitionName(newPartition);
>     this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
>     this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
>     getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
>     getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

Reply via email to