[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103138#comment-17103138
 ] 

Manikandan R edited comment on YARN-6492 at 5/9/20, 7:03 AM:
-------------------------------------------------------------

Thanks [~jhung] and [~epayne] for your support.

I would like to clear some things at high level, especially on 
scope/requirements of this Jira and revisit the background (thought process) on 
how all these related Jira has been created to ensure that we are on same page:

YARN-6467 computes metrics only for default partition. It has been created as 
an interim step towards the major goal of "Providing Metrics at Partition 
Level" to the customers. Major goal is nothing but this JIRA. Since YARN-6467 
is stepping stone for this JIRA, it has been coded in a way that it should 
easily accomodate this Jira changes in a simplistic way (For example,  Just 
removing if(partition is default) check inside each metric computation method 
expected to take care most of the things and no more changes required on collar 
side).

Though YARN-6467 covers some aspects, it had created confusion (for the queue's 
associated with multiple partitions) as well. Original QueueMetrics computation 
behaviour has been changed. Original QueueMetrics computation is nothing but 
the metrics computation only from Queue perspective irrespective of how many 
partitions it has been associated to and nothing to do with Partitions.  It 
started providing metrics only for "default" partition by replacing the 
original behaviour. Another reasons for taking up this path is, this Jira 
expected to go into the trunk immediately after YARN-6467 (as planned :) ) and 
hence there won't be any inconsistency in original queue metrics computation 
behaviour, but it didn't happened.

So, whenever we said "backward compatibility" we referred to this Original 
QueueMetrics computation, not "existing QueueMetrics should still only contain 
metrics for default partition".  In other words, Original QueueMetrics 
computation is nothing but the code/behaviour before YARN-6467.

Now, let me explain scope of this JIRA. We would like to achieve the following 
things:
 # Partition * Queue Metrics:

A partition can be associated with many queues. So we need to break up, hence 
we need Partition * Queue metrics.

Proposed structure is

PartitionMetric (labelX)
     QueueMetric (A)
        metrics
        Usermetrics 
        QueueMetric (A1)
            metrics
            Usermetrics
        QueueMetric (A2)
            metrics
            Usermetrics
     QueueMetric (B)
        metrics
        Usermetrics
 PartitionMetric (labelY)
     QueueMetric (A)
          QueueMetric (A1) 
          QueueMetric (A2)
     QueueMetric (B)
           …

{{QueueMetrics#getPartitionQueueMetrics }} takes care of this registration into 
Metric system and use this object for all metric computations.

Sample JMX o/p is 
{code:java}
"name" : 
"Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x,q0=root,q1=a"
 ...{code}
 

 

2. Partition metrics:

Partition level metrics computation. This can help Admins to analyse the usage 
at Partition level.

Proposed structure is

PartitionMetric (labelX)
    metrics
 PartitionMetric (labelY)
    metrics


 {{PartitionQueueMetrics#getPartitionQueueMetrics }} takes care of this 
registration into Metric system and use this object for all metric computations.

Sample JMX o/p is 
{code:java}
"name" : 
"Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x" 
...{code}
In addition to these 2 changes, we would like to retain the Original 
QueueMetrics computation behaviour.

Hope the above explanation explains why the below assert has been changed:
{noformat}
assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}
is changed to
{noformat}
assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}
This assert has been added as part of YARN-9596 to ensure YARN-6467 works 
correctly. YARN-9767 exhaustive unit test changes explain the difference 
between Partition * Queue Metrics, Partition Metrics and Original QueueMetrics 
very clearly.

What changes this patch should contain?

Yes, there is some confusion as some changes are in YARN-9767. #2 described in 
YARN-9767 should be in this patch. Otherwise, this patch is incomplete from 
feature rollout perspective. [~jhung] said #1 described in YARN-9767 was there 
even before. My understanding, it should be happening after YARN-6467 only. 
Would it be better if we handle that too here? What do you think?

Please share your opinions. Post that, will post the proper patch first and 
then reviews can be taken up on the same.

 

In order to be consistent with other API responses like 
{{/ws/v1/cluster/scheduler}}, I think this should just be an empty string. So, 
I would expect the JMX response to look like the following for 
DEFAULT_PARTITION:\{quote}

 

Yes [~epayne], we should handle this for sure.


was (Author: maniraj...@gmail.com):
Thanks [~jhung] and [~epayne] for your support.

I would like to clear some things at high level, especially on 
scope/requirements of this Jira and revisit the background (thought process) on 
how all these related Jira has been created to ensure that we are on same page:

YARN-6467 computes metrics only for default partition. It has been created as 
an interim step towards the major goal of "Providing Metrics at Partition 
Level" to the customers. Major goal is nothing but this JIRA. Since YARN-6467 
is stepping stone for this JIRA, it has been coded in a way that it should 
easily accomodate this Jira changes in a simplistic way (For example,  Just 
removing if(partition is default) check inside each metric computation method 
expected to take care most of the things and no more changes required on collar 
side).

Though YARN-6467 covers some aspects, it had created confusion (for the queue's 
associated with multiple partitions) as well. Original QueueMetrics computation 
behaviour has been changed. Original QueueMetrics computation is nothing but 
the metrics computation only from Queue perspective irrespective of how many 
partitions it has been associated to and nothing to do with Partitions.  It 
started providing metrics only for "default" partition by replacing the 
original behaviour. Another reasons for taking up this path is, this Jira 
expected to go into the trunk immediately after YARN-6467 (as planned :) ) and 
hence there won't be any inconsistency in original queue metrics computation 
behaviour, but it didn't happened.

So, whenever we said "backward compatibility" we referred to this Original 
QueueMetrics computation, not "existing QueueMetrics should still only contain 
metrics for default partition".  In other words, Original QueueMetrics 
computation is nothing but the code/behaviour before YARN-6467.

Now, let me explain scope of this JIRA. We would like to achieve the following 
things:
 # Partition * Queue Metrics:

A partition can be associated with many queues. So we need to break up, hence 
we need Partition * Queue metrics.

Proposed structure is


 PartitionMetric (labelX)
    QueueMetric (A)
       metrics
       Usermetrics 
       QueueMetric (A1)
           metrics
           Usermetrics
       QueueMetric (A2)
           metrics
           Usermetrics
    QueueMetric (B)
       metrics
       Usermetrics
 PartitionMetric (labelY)
    QueueMetric (A)
         QueueMetric (A1) 
         QueueMetric (A2)
    QueueMetric (B)
          …

{{QueueMetrics#getPartitionQueueMetrics }} takes care of this registration into 
Metric system and use this object for all metric computations.

Sample JMX o/p is 
{code:java}
"name" : 
"Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x,q0=root,q1=a"
 ...{code}
 

 

2. Partition metrics:

Partition level metrics computation. This can help Admins to analyse the usage 
at Partition level.

Proposed structure is
 PartitionMetric (labelX)
 metrics
 PartitionMetric (labelY)
 metrics
 {{PartitionQueueMetrics#getPartitionQueueMetrics }} takes care of this 
registration into Metric system and use this object for all metric computations.

Sample JMX o/p is 
{code:java}
"name" : 
"Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x" 
...{code}
In addition to these 2 changes, we would like to retain the Original 
QueueMetrics computation behaviour.

Hope the above explanation explains why the below assert has been changed:
{noformat}
assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}
is changed to
{noformat}
assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat}
This assert has been added as part of YARN-9596 to ensure YARN-6467 works 
correctly. YARN-9767 exhaustive unit test changes explain the difference 
between Partition * Queue Metrics, Partition Metrics and Original QueueMetrics 
very clearly.

What changes this patch should contain?

Yes, there is some confusion as some changes are in YARN-9767. #2 described in 
YARN-9767 should be in this patch. Otherwise, this patch is incomplete from 
feature rollout perspective. [~jhung] said #1 described in YARN-9767 was there 
even before. My understanding, it should be happening after YARN-6467 only. 
Would it be better if we handle that too here? What do you think?

Please share your opinions. Post that, will post the proper patch first and 
then reviews can be taken up on the same.

 

In order to be consistent with other API responses like 
{{/ws/v1/cluster/scheduler}}, I think this should just be an empty string. So, 
I would expect the JMX response to look like the following for 
DEFAULT_PARTITION:\{quote}

 

Yes [~epayne], we should handle this for sure.

> Generate queue metrics for each partition
> -----------------------------------------
>
>                 Key: YARN-6492
>                 URL: https://issues.apache.org/jira/browse/YARN-6492
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>            Reporter: Jonathan Hung
>            Assignee: Manikandan R
>            Priority: Major
>         Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to