Manikandan R created YARN-9767:
----------------------------------

             Summary: PartitionQueueMetrics Issues
                 Key: YARN-9767
                 URL: https://issues.apache.org/jira/browse/YARN-9767
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Manikandan R
            Assignee: Manikandan R


The intent of the Jira is to capture the issues/observations encountered as 
part of YARN-6492 development separately for ease of tracking.

Observations:

Please refer this 

https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027

1. Since partition info are being extracted from request and node, there is a 
problem. For example, 
 
Node N has been mapped to Label X (Non exclusive). Queue A has been configured 
with ANY Node label. App A requested resources from Queue A and its containers 
ran on Node N for some reasons. During AbstractCSQueue#allocateResource call, 
Node partition (using SchedulerNode ) would get used for calculation. Lets say 
allocate call has been fired for 3 containers of 1 GB each, then

a. PartitionDefault * queue A -> pending mb is 3 GB
b. PartitionX * queue A -> pending mb is -3 GB
 
is the outcome. Because app request has been fired without any label 
specification and #a metrics has been derived. After allocation is over, 
pending resources usually gets decreased. When this happens, it use node 
partition info. hence #b metrics has derived. 
 
Given this kind of situation, We will need to put some thoughts on achieving 
the metrics correctly.
 
2. Though the intent of this jira is to do Partition Queue Metrics, we would 
like to retain the existing Queue Metrics for backward compatibility (as you 
can see from jira's discussion).

With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
overridden either with some specific partition values or default partition 
values. It could be vice - versa as well. For example, after the queues (say 
queue A) has been initialised with some min and max cap and also with node 
label's min and max cap, Queuemetrics (availableMB) for queue A return values 
based on node label's cap config.

I've been working on these observations to provide a fix and attached 
.005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
availableVcores is correct (Please refer above #2 observation). Added more 
asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for #2 
is working properly.

Also one more thing to note is, user metrics for availableMB, availableVcores 
at root queue was not there even before. Retained the same behaviour. User 
metrics for availableMB, availableVcores is available only at child queue's 
level and also with partitions.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to