[jira] [Commented] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

Fencheng Mei (Jira) Wed, 13 Dec 2023 23:27:06 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796595#comment-17796595
 ]


Fencheng Mei commented on YARN-10517:
-------------------------------------

I encountered the same issue and managed to successfully reproduce it. After 
applying Qi Zhu's fix patch, using the same method no longer reproduces the 
issue. I'll share the steps to reproduce it (applicable to version 3.3.2):
 # Apply the 'Test' label to machines A and B, then submit a Spark job to the 
corresponding queue, Queue.
 # While the Spark job is running, remove the label from machine A.
 # The Spark job runs to completion without issues.
 # In any monitoring system, it can be observed that the ”AllocatedMB“ metric 
for the Queue where the Spark job was submitted is incorrect.

> QueueMetrics has incorrect Allocated Resource when labelled partitions updated
> ------------------------------------------------------------------------------
>
>                 Key: YARN-10517
>                 URL: https://issues.apache.org/jira/browse/YARN-10517
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0, 3.3.0
>            Reporter: sibyl.lv
>            Assignee: Qi Zhu
>            Priority: Major
>         Attachments: YARN-10517-branch-3.2.001.patch, YARN-10517.001.patch, 
> wrong metrics.png
>
>
> After https://issues.apache.org/jira/browse/YARN-9596, QueueMetrics still has 
> incorrect allocated jmx, such as  {color:#660e7a}allocatedMB, 
> {color}{color:#660e7a}allocatedVCores and 
> {color}{color:#660e7a}allocatedContainers, {color}when the node partition is 
> updated from "DEFAULT" to other label and there are  running applications.
> Steps to reproduce
> ==============
>  # Configure capacity-scheduler.xml with label configuration
>  # Submit one application to default partition and run
>  # Add label "tpcds" to cluster and replace label on node1 and node2 to be 
> "tpcds" when the above application is running
>  # Note down "VCores Used" at Web UI
>  # When the application is finished, the metrics get wrong (screenshots 
> attached).
> ==============
>  
> FiCaSchedulerApp doesn't update queue metrics when CapacityScheduler handles 
> this event {color:#660e7a}NODE_LABELS_UPDATE.{color}
> So we should release container resource from old partition and add used 
> resource to new partition, just as updating queueUsage.
> {code:java}
> // code placeholder
> public void nodePartitionUpdated(RMContainer rmContainer, String oldPartition,
>     String newPartition) {
>   Resource containerResource = rmContainer.getAllocatedResource();
>   this.attemptResourceUsage.decUsed(oldPartition, containerResource);
>   this.attemptResourceUsage.incUsed(newPartition, containerResource);
>   getCSLeafQueue().decUsedResource(oldPartition, containerResource, this);
>   getCSLeafQueue().incUsedResource(newPartition, containerResource, this);
>   // Update new partition name if container is AM and also update AM resource
>   if (rmContainer.isAMContainer()) {
>     setAppAMNodePartitionName(newPartition);
>     this.attemptResourceUsage.decAMUsed(oldPartition, containerResource);
>     this.attemptResourceUsage.incAMUsed(newPartition, containerResource);
>     getCSLeafQueue().decAMUsedResource(oldPartition, containerResource, this);
>     getCSLeafQueue().incAMUsedResource(newPartition, containerResource, this);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10517) QueueMetrics has incorrect Allocated Resource when labelled partitions updated

Reply via email to