[jira] [Commented] (YARN-9767) PartitionQueueMetrics Issues

2020-06-01 Thread Manikandan R (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123337#comment-17123337
 ] 

Manikandan R commented on YARN-9767:


YARN-6492 patch covered this fixes too. Hence closing this.

> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9767) PartitionQueueMetrics Issues

2020-01-09 Thread Manikandan R (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012004#comment-17012004
 ] 

Manikandan R commented on YARN-9767:


Sorry for the delay. Rebased the patch. It should work now. Also it requires 
YARN-6492.007.WIP.patch.

Regarding {{LeafQueue#getheadroom}}, made changes for some reason, will need to 
recollect. I think to address -ve metrics values.

However, Did a quick revisit and run junits without those changes expecting it 
to be run without failures. Only 1 assert is failing.

assertEquals(2 * GB, queueAUserMetrics.getAvailableMB());

in 
TestNodeLabelContainerAllocation#testQueueMetricsWithLabelsOnDefaultLabelNode. 
Expecting it to be -1. Since userLimitResource value is derived based on the 
partition exclusivity property, hence allowing the user to go beyond the max 
capacity and finally leading to this situation. Will dig deeper and update.

> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9767) PartitionQueueMetrics Issues

2019-12-19 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000205#comment-17000205
 ] 

Eric Payne commented on YARN-9767:
--

[~maniraj...@gmail.com], Thanks for raising this issue and providing a patch.

I am very sorry for my very late reply. Unfortunately, due to my long delay, 
this patch no longer applies.

I am still trying to get my head around the use case. However, looking at the 
code, I'm very concerned that changes related to queue metrics would change 
behavior of {{LeafQueue#getheadroom}}, which is integral to the way the 
capacity scheduler assigns resources. I will need to spend a lot of time 
understanding the use case and the problems this JIRA is solving.


> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9767) PartitionQueueMetrics Issues

2019-08-29 Thread Manikandan R (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918755#comment-16918755
 ] 

Manikandan R commented on YARN-9767:


Attaching .001.patch for review.

> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-9767.001.patch
>
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9767) PartitionQueueMetrics Issues

2019-08-20 Thread Manikandan R (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911575#comment-16911575
 ] 

Manikandan R commented on YARN-9767:


On #1 observation,

After container allocation, pending resources gets deducted inside 
{{QueueeMetrics#allocateResources}} using the Node Partition as opposed to 
requested partition info. I think RMContainerImpl#getNodeLabelExpression can be 
used to decreasing pending resources as it is more appropriate because of 
following reasons:

1. {{RMContainerImpl#getNodeLabelExpression}} is derived from 
{{AppPlacementAllocator#getPrimaryRequestedNodePartition}}. Java doc of 
{{AppPlacementAllocator#getPrimaryRequestedNodePartition}} is good enough to 
explain this.
 2. In this case, actual intent is to run on ANY where (which is nothing but 
the "default" partition) but ended up in using some non exclusive partition. So 
increasing pending resources on "default" partition or 
PrimaryRequestedNodePartition (mostly "default" or any specific partition) and 
deducting the pending resources in the same way seems to be correct one rather 
than increasing and decreasing in two different places.

So fix would be something like

{{AppSchedulingInfo#updateMetrics}}
{code:java}
   queue.getMetrics().allocateResources(node.getPartition(), user, 1,
  containerAllocated.getContainer().getResource(), false);
  queue.getMetrics().decrPendingResources(
  containerAllocated.getNodeLabelExpression(), user, 1,
  containerAllocated.getContainer().getResource());
{code}
instead of
{code:java}
   queue.getMetrics().allocateResources(node.getPartition(), user, 1,
  containerAllocated.getContainer().getResource(), true);
{code}

Please share your thoughts.

> PartitionQueueMetrics Issues
> 
>
> Key: YARN-9767
> URL: https://issues.apache.org/jira/browse/YARN-9767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> The intent of the Jira is to capture the issues/observations encountered as 
> part of YARN-6492 development separately for ease of tracking.
> Observations:
> Please refer this 
> https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027
> 1. Since partition info are being extracted from request and node, there is a 
> problem. For example, 
>  
> Node N has been mapped to Label X (Non exclusive). Queue A has been 
> configured with ANY Node label. App A requested resources from Queue A and 
> its containers ran on Node N for some reasons. During 
> AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) 
> would get used for calculation. Lets say allocate call has been fired for 3 
> containers of 1 GB each, then
> a. PartitionDefault * queue A -> pending mb is 3 GB
> b. PartitionX * queue A -> pending mb is -3 GB
>  
> is the outcome. Because app request has been fired without any label 
> specification and #a metrics has been derived. After allocation is over, 
> pending resources usually gets decreased. When this happens, it use node 
> partition info. hence #b metrics has derived. 
>  
> Given this kind of situation, We will need to put some thoughts on achieving 
> the metrics correctly.
>  
> 2. Though the intent of this jira is to do Partition Queue Metrics, we would 
> like to retain the existing Queue Metrics for backward compatibility (as you 
> can see from jira's discussion).
> With this patch and YARN-9596 patch, queuemetrics (for queue's) would be 
> overridden either with some specific partition values or default partition 
> values. It could be vice - versa as well. For example, after the queues (say 
> queue A) has been initialised with some min and max cap and also with node 
> label's min and max cap, Queuemetrics (availableMB) for queue A return values 
> based on node label's cap config.
> I've been working on these observations to provide a fix and attached 
> .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, 
> availableVcores is correct (Please refer above #2 observation). Added more 
> asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for 
> #2 is working properly.
> Also one more thing to note is, user metrics for availableMB, availableVcores 
> at root queue was not there even before. Retained the same behaviour. User 
> metrics for availableMB, availableVcores is available only at child queue's 
> level and also with partitions.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h..