[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

Eric Payne (JIRA) Wed, 24 Jul 2019 09:35:25 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891986#comment-16891986
 ]


Eric Payne commented on YARN-9596:
----------------------------------

I'd like to document why a branch-3.0 patch was necessary.

In trunk and 3.2, {{CSQueueUtils.java#getMaxAvailableResourceToQueue}} 
calculated {{totalAvailableResource}} as follows:
{code:title=Trunk version of CSQueueUtils.java#getMaxAvailableResourceToQueue}
    Resource totalAvailableResource = Resources.createResource(0, 0);
{code}
So, the new {{getMaxAvailableResourceToQueuePartition}} method calculated the 
same way.

However, when backporting to 3.0, {{totalAvailableResource}} should not be done 
the same way because it's different in 3.0:
{code:title=3.0 version of CSQueueUtils.java#getMaxAvailableResourceToQueue}
    Resource queueGuranteedResource = Resources.multiply(nlm
        .getResourceByLabel(partition, cluster), queue.getQueueCapacities()
        .getAbsoluteCapacity(partition));
{code}

> QueueMetrics has incorrect metrics when labelled partitions are involved
> ------------------------------------------------------------------------
>
>                 Key: YARN-9596
>                 URL: https://issues.apache.org/jira/browse/YARN-9596
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 2.8.0, 3.3.0
>            Reporter: Muhammad Samir Khan
>            Assignee: Muhammad Samir Khan
>            Priority: Major
>         Attachments: Screen Shot 2019-06-03 at 4.41.45 PM.png, Screen Shot 
> 2019-06-03 at 4.44.15 PM.png, YARN-9596-branch-3.0.004.patch, 
> YARN-9596.001.patch, YARN-9596.002.patch, YARN-9596.003.patch
>
>
> After YARN-6467, QueueMetrics should only be tracking metrics for the default 
> partition. However, the metrics are incorrect when labelled partitions are 
> involved.
> Steps to reproduce
> ==============
>  # Configure capacity-scheduler.xml with label configuration
>  # Add label "test" to cluster and replace label on node1 to be "test"
>  # Note down "totalMB" at 
> <resourcemanager.webapp.address:port>/ws/v1/cluster/metrics
>  # Start first job on test queue.
>  # Start second job on default queue (does not work if the order of two jobs 
> is swapped).
>  # While the two applications are running, the "totalMB" at 
> <resourcemanager.webapp.address:port>/ws/v1/cluster/metrics will go down by 
> the amount of MB used by the first job (screenshots attached).
> Alternately:
> In 
> TestNodeLabelContainerAllocation.testQueueMetricsWithLabelsOnDefaultLabelNode(),
>  add the following line at the end of the test before rm1.close():
> CSQueue rootQueue = cs.getRootQueue();
> assertEquals(10*GB,
>  rootQueue.getMetrics().getAvailableMB() + 
> rootQueue.getMetrics().getAllocatedMB());
> There are two nodes of 10GB each and only one of them have a non-default 
> label. The test will also fail against 20*GB check.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9596) QueueMetrics has incorrect metrics when labelled partitions are involved

Reply via email to