[ https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648 ]
Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:57 PM: ---------------------------------------------------------------- Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 2);{noformat} should be {noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 2);{noformat} Perhaps we should only update partition metrics in setAvailableResourcesToQueue if the queue is root? * Delete {noformat}System.out.println(" final is " + parentQueueSource_X.toString());{noformat} * Same in TestQueueMetrics, there should not be capacity scheduler specific logic here, can we remove these changes? * On line 2539 of TestNodeLabelContainerAllocation, should {noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} be {noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} ? * Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The stuff removed seems to be non-exclusive node label functionality (default partition node heartbeating, and checking queue metrics are correct), so we probably want to keep these tests. * On line 2566, how is node1 getting 8 containers if queue A's max capacity is only 50% of 10GB = 5GB? was (Author: jhung): Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments: * We can change parentQueue in QueueMetrics.java to be Queue instead of AbstractCSQueue (to fix test cases) * Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + userName", can we change this to "partition + '.' + userName + '.' + queuePath" ? In particular the queuePath + userName part could cause conflicts (e.g. queue named "root.auser" could conflict with user metrics under queue "root.a" and username "user"). I see a few places for this: # PartitionQueueMetrics#constructor#parentMetricName # PartitionQueueMetrics#getUserMetrics#metricName # QueueMetrics#getUserMetrics#metricName # QueueMetrics#getPartitionQueueMetrics#metricName # Key for QueueMetrics#getPartitionMetrics could collide if the partition name is "root" * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add the metrics object to QUEUE_METRICS, since we're accessing user metrics via the user map (and not the QUEUE_METRICS map) * In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I don't think we need to add queue path to the key, since the users map is not static * QueueMetrics#queueSource method does not seem to be used anywhere, can we delete it? * How come we need a CSQueueMetrics#forQueue implementation? It looks the same as QueueMetrics#forQueue * We shouldn't add capacity scheduler specific things in QueueInfo, are these changes needed? * For partition metrics, I don't think setAvailableResourcesToQueue is handled correctly. It appears to update partition metrics no matter which queue this method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics: {noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 2);{noformat} should be {noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 2);{noformat} Perhaps we should only update partition metrics in setAvailableResourcesToQueue if the queue is root? * Delete {noformat}println System.out.println(" final is " + parentQueueSource_X.toString());{noformat} * Same in TestQueueMetrics, there should not be capacity scheduler specific logic here, can we remove these changes? * On line 2539 of TestNodeLabelContainerAllocation, should {noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} be {noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), delta);{noformat} ? * Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The stuff removed seems to be non-exclusive node label functionality (default partition node heartbeating, and checking queue metrics are correct), so we probably want to keep these tests. * On line 2566, how is node1 getting 8 containers if queue A's max capacity is only 50% of 10GB = 5GB? > Generate queue metrics for each partition > ----------------------------------------- > > Key: YARN-6492 > URL: https://issues.apache.org/jira/browse/YARN-6492 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler > Reporter: Jonathan Hung > Assignee: Manikandan R > Priority: Major > Attachments: PartitionQueueMetrics_default_partition.txt, > PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, > YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, > YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, > YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, > partition_metrics.txt > > > We are interested in having queue metrics for all partitions. Right now each > queue has one QueueMetrics object which captures metrics either in default > partition or across all partitions. (After YARN-6467 it will be in default > partition) > But having the partition metrics would be very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org