[
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454697#comment-17454697
]
Szilard Nemeth commented on YARN-11016:
---------------------------------------
Hi [~gandras],
Just committed your patch to trunk.
Could you please check whether it's required to backport this to branch-3.3 /
branch-3.2?
Thanks.
> Queue weight is incorrectly reset to zero
> -----------------------------------------
>
> Key: YARN-11016
> URL: https://issues.apache.org/jira/browse/YARN-11016
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Reporter: Andras Gyori
> Assignee: Andras Gyori
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node
> labels are inherited, its children, for example 'child' has 'test' label as
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call
> LeafQueue#activateApplications, which then calls
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see
> getNodeLabelsForQueue).
> In this case, the labels are the accessible node labels (the inherited
> 'test).
> During this event, the ResourceUsage object is updated for the label 'test',
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event,
> we now have 'test' label in ResourceUsage even though it was never explicitly
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the
> union of the node labels from QueueCapacities and ResourceUsage (this union
> is now the empty default label AND 'test') and updates QueueCapacities with
> the label 'perf-test'.
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the
> QueueCapacities values to zero (even weight, which is wrong in my opinion)
> and loads the values again from the config.
> The problem here is that values are reset for all node labels in
> QueueCapacities (even for 'test'), but we only load the values for the
> configured node labels (which we did not set, so it is defaulted to the empty
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities
> and that is why the update fails.
> It even explains why validation passes, because the validation endpoint
> instantiates a brand new CapacityScheduler for which these cascade of effects
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used
> mixed of weight mode, percentage and absolute mode, it is not allowed, please
> double check, details:{noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]