[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518921#comment-17518921
 ] 

Imran Chaush commented on YARN-11016:
-------------------------------------

Thanks

> Queue weight is incorrectly reset to zero
> -----------------------------------------
>
>                 Key: YARN-11016
>                 URL: https://issues.apache.org/jira/browse/YARN-11016
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Andras Gyori
>            Assignee: Andras Gyori
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event, the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from the config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to