[
https://issues.apache.org/jira/browse/YARN-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112648#comment-17112648
]
Jonathan Hung edited comment on YARN-6492 at 5/20/20, 10:59 PM:
----------------------------------------------------------------
Thank you [[email protected]]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath +
userName", can we change this to "partition + '.' + userName + '.' + queuePath"
? In particular the queuePath + userName part could cause conflicts (e.g. queue
named "root.auser" could conflict with user metrics under queue "root.a" and
username "user"). Putting the user before the queue and adding the delimiter
should prevent the user from being interpreted as part of the queue path. I see
a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I
don't think we need to add the metrics object to QUEUE_METRICS, since we're
accessing user metrics via the {{users}} map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I
don't think we need to add queue path to the key, since the {{users}} map is
not static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled
correctly. It appears to update partition metrics no matter which queue this
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2,
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2,
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue
if the queue is root?
* Delete {noformat}System.out.println(" final is " +
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(),
delta);{noformat}
be
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(),
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The
stuff removed seems to be non-exclusive node label functionality (default
partition node heartbeating, and checking queue metrics are correct), so we
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is
only 50% of 10GB = 5GB?
was (Author: jhung):
Thank you [[email protected]]. Looks fine at a high level. A few comments:
* We can change parentQueue in QueueMetrics.java to be Queue instead of
AbstractCSQueue (to fix test cases)
* Right now we're concatenating QUEUE_METRICS keys as "partition + queuePath +
userName", can we change this to "partition + '.' + userName + '.' + queuePath"
? In particular the queuePath + userName part could cause conflicts (e.g. queue
named "root.auser" could conflict with user metrics under queue "root.a" and
username "user"). Putting the user before the queue and adding the delimiter
should prevent the user from being interpreted as part of the queue path. I see
a few places for this:
# PartitionQueueMetrics#constructor#parentMetricName
# PartitionQueueMetrics#getUserMetrics#metricName
# QueueMetrics#getUserMetrics#metricName
# QueueMetrics#getPartitionQueueMetrics#metricName
# Key for QueueMetrics#getPartitionMetrics could collide if the partition name
is "root"
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I
don't think we need to add the metrics object to QUEUE_METRICS, since we're
accessing user metrics via the user map (and not the QUEUE_METRICS map)
* In QueueMetrics#getUserMetrics and PartitionQueueMetrics#getUserMetrics, I
don't think we need to add queue path to the key, since the users map is not
static
* QueueMetrics#queueSource method does not seem to be used anywhere, can we
delete it?
* How come we need a CSQueueMetrics#forQueue implementation? It looks the same
as QueueMetrics#forQueue
* We shouldn't add capacity scheduler specific things in QueueInfo, are these
changes needed?
* For partition metrics, I don't think setAvailableResourcesToQueue is handled
correctly. It appears to update partition metrics no matter which queue this
method is invoked for. Thus for example on line 87 of TestPartitionQueueMetrics:
{noformat}checkResources(partitionSource, 0, 0, 0, 100 * GB, 100, 2 * GB, 2,
2);{noformat}
should be
{noformat}checkResources(partitionSource, 0, 0, 0, 200 * GB, 200, 2 * GB, 2,
2);{noformat}
Perhaps we should only update partition metrics in setAvailableResourcesToQueue
if the queue is root?
* Delete {noformat}System.out.println(" final is " +
parentQueueSource_X.toString());{noformat}
* Same in TestQueueMetrics, there should not be capacity scheduler specific
logic here, can we remove these changes?
* On line 2539 of TestNodeLabelContainerAllocation, should
{noformat}assertEquals(2 * GB, queueAUserMetrics.getAvailableMB(),
delta);{noformat}
be
{noformat}assertEquals(1.5 * GB, queueAUserMetrics.getAvailableMB(),
delta);{noformat}
?
* Do we need the tests after line 2551 on TestNodeLabelContainerAllocation? The
stuff removed seems to be non-exclusive node label functionality (default
partition node heartbeating, and checking queue metrics are correct), so we
probably want to keep these tests.
* On line 2566, how is node1 getting 8 containers if queue A's max capacity is
only 50% of 10GB = 5GB?
> Generate queue metrics for each partition
> -----------------------------------------
>
> Key: YARN-6492
> URL: https://issues.apache.org/jira/browse/YARN-6492
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: capacity scheduler
> Reporter: Jonathan Hung
> Assignee: Manikandan R
> Priority: Major
> Attachments: PartitionQueueMetrics_default_partition.txt,
> PartitionQueueMetrics_x_partition.txt, PartitionQueueMetrics_y_partition.txt,
> YARN-6492.001.patch, YARN-6492.002.patch, YARN-6492.003.patch,
> YARN-6492.004.patch, YARN-6492.005.WIP.patch, YARN-6492.006.WIP.patch,
> YARN-6492.007.WIP.patch, YARN-6492.008.WIP.patch, YARN-6492.009.WIP.patch,
> partition_metrics.txt
>
>
> We are interested in having queue metrics for all partitions. Right now each
> queue has one QueueMetrics object which captures metrics either in default
> partition or across all partitions. (After YARN-6467 it will be in default
> partition)
> But having the partition metrics would be very useful.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]