[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103138#comment-17103138 ]
Manikandan R edited comment on YARN-6492 at 5/9/20, 7:03 AM: ------------------------------------------------------------- Thanks [~jhung] and [~epayne] for your support. I would like to clear some things at high level, especially on scope/requirements of this Jira and revisit the background (thought process) on how all these related Jira has been created to ensure that we are on same page: YARN-6467 computes metrics only for default partition. It has been created as an interim step towards the major goal of "Providing Metrics at Partition Level" to the customers. Major goal is nothing but this JIRA. Since YARN-6467 is stepping stone for this JIRA, it has been coded in a way that it should easily accomodate this Jira changes in a simplistic way (For example, Just removing if(partition is default) check inside each metric computation method expected to take care most of the things and no more changes required on collar side). Though YARN-6467 covers some aspects, it had created confusion (for the queue's associated with multiple partitions) as well. Original QueueMetrics computation behaviour has been changed. Original QueueMetrics computation is nothing but the metrics computation only from Queue perspective irrespective of how many partitions it has been associated to and nothing to do with Partitions. It started providing metrics only for "default" partition by replacing the original behaviour. Another reasons for taking up this path is, this Jira expected to go into the trunk immediately after YARN-6467 (as planned :) ) and hence there won't be any inconsistency in original queue metrics computation behaviour, but it didn't happened. So, whenever we said "backward compatibility" we referred to this Original QueueMetrics computation, not "existing QueueMetrics should still only contain metrics for default partition". In other words, Original QueueMetrics computation is nothing but the code/behaviour before YARN-6467. Now, let me explain scope of this JIRA. We would like to achieve the following things: # Partition * Queue Metrics: A partition can be associated with many queues. So we need to break up, hence we need Partition * Queue metrics. Proposed structure is PartitionMetric (labelX) QueueMetric (A) metrics Usermetrics QueueMetric (A1) metrics Usermetrics QueueMetric (A2) metrics Usermetrics QueueMetric (B) metrics Usermetrics PartitionMetric (labelY) QueueMetric (A) QueueMetric (A1) QueueMetric (A2) QueueMetric (B) … {{QueueMetrics#getPartitionQueueMetrics }} takes care of this registration into Metric system and use this object for all metric computations. Sample JMX o/p is {code:java} "name" : "Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x,q0=root,q1=a" ...{code} 2. Partition metrics: Partition level metrics computation. This can help Admins to analyse the usage at Partition level. Proposed structure is PartitionMetric (labelX) metrics PartitionMetric (labelY) metrics {{PartitionQueueMetrics#getPartitionQueueMetrics }} takes care of this registration into Metric system and use this object for all metric computations. Sample JMX o/p is {code:java} "name" : "Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x" ...{code} In addition to these 2 changes, we would like to retain the Original QueueMetrics computation behaviour. Hope the above explanation explains why the below assert has been changed: {noformat} assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to {noformat} assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} This assert has been added as part of YARN-9596 to ensure YARN-6467 works correctly. YARN-9767 exhaustive unit test changes explain the difference between Partition * Queue Metrics, Partition Metrics and Original QueueMetrics very clearly. What changes this patch should contain? Yes, there is some confusion as some changes are in YARN-9767. #2 described in YARN-9767 should be in this patch. Otherwise, this patch is incomplete from feature rollout perspective. [~jhung] said #1 described in YARN-9767 was there even before. My understanding, it should be happening after YARN-6467 only. Would it be better if we handle that too here? What do you think? Please share your opinions. Post that, will post the proper patch first and then reviews can be taken up on the same. In order to be consistent with other API responses like {{/ws/v1/cluster/scheduler}}, I think this should just be an empty string. So, I would expect the JMX response to look like the following for DEFAULT_PARTITION:\{quote} Yes [~epayne], we should handle this for sure. was (Author: maniraj...@gmail.com): Thanks [~jhung] and [~epayne] for your support. I would like to clear some things at high level, especially on scope/requirements of this Jira and revisit the background (thought process) on how all these related Jira has been created to ensure that we are on same page: YARN-6467 computes metrics only for default partition. It has been created as an interim step towards the major goal of "Providing Metrics at Partition Level" to the customers. Major goal is nothing but this JIRA. Since YARN-6467 is stepping stone for this JIRA, it has been coded in a way that it should easily accomodate this Jira changes in a simplistic way (For example, Just removing if(partition is default) check inside each metric computation method expected to take care most of the things and no more changes required on collar side). Though YARN-6467 covers some aspects, it had created confusion (for the queue's associated with multiple partitions) as well. Original QueueMetrics computation behaviour has been changed. Original QueueMetrics computation is nothing but the metrics computation only from Queue perspective irrespective of how many partitions it has been associated to and nothing to do with Partitions. It started providing metrics only for "default" partition by replacing the original behaviour. Another reasons for taking up this path is, this Jira expected to go into the trunk immediately after YARN-6467 (as planned :) ) and hence there won't be any inconsistency in original queue metrics computation behaviour, but it didn't happened. So, whenever we said "backward compatibility" we referred to this Original QueueMetrics computation, not "existing QueueMetrics should still only contain metrics for default partition". In other words, Original QueueMetrics computation is nothing but the code/behaviour before YARN-6467. Now, let me explain scope of this JIRA. We would like to achieve the following things: # Partition * Queue Metrics: A partition can be associated with many queues. So we need to break up, hence we need Partition * Queue metrics. Proposed structure is PartitionMetric (labelX) QueueMetric (A) metrics Usermetrics QueueMetric (A1) metrics Usermetrics QueueMetric (A2) metrics Usermetrics QueueMetric (B) metrics Usermetrics PartitionMetric (labelY) QueueMetric (A) QueueMetric (A1) QueueMetric (A2) QueueMetric (B) … {{QueueMetrics#getPartitionQueueMetrics }} takes care of this registration into Metric system and use this object for all metric computations. Sample JMX o/p is {code:java} "name" : "Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x,q0=root,q1=a" ...{code} 2. Partition metrics: Partition level metrics computation. This can help Admins to analyse the usage at Partition level. Proposed structure is PartitionMetric (labelX) metrics PartitionMetric (labelY) metrics {{PartitionQueueMetrics#getPartitionQueueMetrics }} takes care of this registration into Metric system and use this object for all metric computations. Sample JMX o/p is {code:java} "name" : "Hadoop:service=ResourceManager,name=PartitionQueueMetrics,partition=x" ...{code} In addition to these 2 changes, we would like to retain the Original QueueMetrics computation behaviour. Hope the above explanation explains why the below assert has been changed: {noformat} assertEquals(10 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} is changed to {noformat} assertEquals(22 * GB, leafQueueA.getMetrics().getAvailableMB());{noformat} This assert has been added as part of YARN-9596 to ensure YARN-6467 works correctly. YARN-9767 exhaustive unit test changes explain the difference between Partition * Queue Metrics, Partition Metrics and Original QueueMetrics very clearly. What changes this patch should contain? Yes, there is some confusion as some changes are in YARN-9767. #2 described in YARN-9767 should be in this patch. Otherwise, this patch is incomplete from feature rollout perspective. [~jhung] said #1 described in YARN-9767 was there even before. My understanding, it should be happening after YARN-6467 only. Would it be better if we handle that too here? What do you think? Please share your opinions. Post that, will post the proper patch first and then reviews can be taken up on the same. In order to be consistent with other API responses like {{/ws/v1/cluster/scheduler}}, I think this should just be an empty string. So, I would expect the JMX response to look like the following for DEFAULT_PARTITION:\{quote} Yes [~epayne], we should handle this for sure. > Generate queue metrics for each partition > ----------------------------------------- > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler > Reporter: Jonathan Hung > Assignee: Manikandan R > Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org