[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:57 PM:
----------------------------------------------------------------

Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?


was (Author: jhung):
Thank you [~maniraj...@gmail.com]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). I see a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}println System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?

> Generate queue metrics for each partition
> -----------------------------------------
>
>                 Key: YARN-6492
>                 URL: https://issues.apache.org/jira/browse/YARN-6492
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>            Reporter: Jonathan Hung
>            Assignee: Manikandan R
>            Priority: Major
>         Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to