[
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manikandan R resolved YARN-9767.
--------------------------------
Resolution: Fixed
> PartitionQueueMetrics Issues
> ----------------------------
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Manikandan R
> Assignee: Manikandan R
> Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a
> problem. For example,
>
> Node N has been mapped to Label X (Non exclusive). Queue A has been
> configured with ANY Node label. App A requested resources from Queue A and
> its containers ran on Node N for some reasons. During
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode )
> would get used for calculation. Lets say allocate call has been fired for 3
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>
> is the outcome. Because app request has been fired without any label
> specification and #a metrics has been derived. After allocation is over,
> pending resources usually gets decreased. When this happens, it use node
> partition info. hence #b metrics has derived.
>
> Given this kind of situation, We will need to put some thoughts on achieving
> the metrics correctly.
>
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would
> like to retain the existing Queue Metrics for backward compatibility (as you
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be
> overridden either with some specific partition values or default partition
> values. It could be vice - versa as well. For example, after the queues (say
> queue A) has been initialised with some min and max cap and also with node
> label's min and max cap, Queuemetrics (availableMB) for queue A return values
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB,
> availableVcores is correct (Please refer above #2 observation). Added more
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores
> at root queue was not there even before. Retained the same behaviour. User
> metrics for availableMB, availableVcores is available only at child queue's
> level and also with partitions.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]