[ 
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
 ] 

Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:59 PM:
----------------------------------------------------------------

Thank you [[email protected]]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). Putting the user before the queue and adding the delimiter 
should prevent the user from being interpreted as part of the queue path. I see 
a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the {{users}} map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the {{users}} map is 
not static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?


was (Author: jhung):
Thank you [[email protected]]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of 
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath + 
userName", can we change this to "partition + '.' + userName + '.' + queuePath" 
? In particular the queuePath + userName part could cause conflicts (e.g. queue 
named "root.auser" could conflict with user metrics under queue "root.a" and 
username "user"). Putting the user before the queue and adding the delimiter 
should prevent the user from being interpreted as part of the queue path. I see 
a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name 
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add the metrics object to QUEUE_METRICS, since we're 
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I 
don't think we need to add queue path to the key, since the users map is not 
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we 
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same 
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these 
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled 
correctly. It appears to update partition metrics no matter which queue this 
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2, 
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2, 
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue 
if the queue is root?
* Delete {noformat}System.out.println(" final is " + 
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific 
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
be 
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(), 
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The 
stuff removed seems to be non-exclusive node label functionality (default 
partition node heartbeating, and checking queue metrics are correct), so we 
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is 
only 50% of 10GB = 5GB?

> Generate queue metrics for each partition
> -----------------------------------------
>
>                 Key: YARN-6492
>                 URL: https://issues.apache.org/jira/browse/YARN-6492
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>            Reporter: Jonathan Hung
>            Assignee: Manikandan R
>            Priority: Major
>         Attachments: PartitionQueueMetrics_default_partition.txt, 
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt, 
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch, 
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch, 
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch, 
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each 
> queue has one QueueMetrics object which captures metrics either in default 
> partition or across all partitions. (After YARN-6467 it will be in default 
> partition)
> But having the partition metrics would be very useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to