[ 
https://issues.apache.org/jira/browse/YARN-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11016:
----------------------------------
    Description: 
QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
cause problems like in the following scenario:
1. Initializing queues

2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
labels are inherited, its children, for example 'child' has 'test' label as its 
accessible-node-label.

3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, 
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each 
label (see getNodeLabelsForQueue). 
In this case, the labels are the accessible node labels (the inherited 'test). 
During this event the ResourceUsage object is updated for the label 'test', 
thus extending its nodeLabelsSet with 'test'.

4. In a following updateClusterResource call, for example an addNode event, we 
now have 'test' label in ResourceUsage even though it was never explicitly 
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union 
of the node labels from QueueCapacities and ResourceUsage (this union is now 
the empty default label AND 'test') and updates QueueCapacities with the label 
'perf-test'. 
Now QueueCapacities has 'test' in its nodeLabelsSet as well!

5. After a reinitialization (like an update from mutation API), the 
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
QueueCapacities values to zero (even weight, which is wrong in my opinion) and 
loads the values again from config. 
The problem here is that values are reset for all node labels in 
QueueCapacities (even for 'test'), but we only load the values for the 
configured node labels (which we did not set, so it is defaulted to the empty 
label).

6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and 
that is why the update fails. 
It even explains why validation passes, because the validation endpoint 
instantiates a brand new CapacityScheduler for which these cascade of effects 
can not accumulate (as there are no multiple updateClusterResource calls)

This scenario manifests as an error when updating via mutation API:
{noformat}
Failed to re-init queues : Parent queue 'parent' have children queue used mixed 
of weight mode, percentage and absolute mode, it is not allowed, please double 
check, details:{noformat}

  was:
QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
labels are inherited, its children, for example 'child' has 'test' label as its 
accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications, 
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each 
labels (see getNodeLabelsForQueue). In this case, the labels are the accessible 
node labels (the inherited 'test). During this event the ResourceUsage object 
is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we 
now have 'test' label in ResourceUsage even though it was never explicitly 
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union 
of the node labels from QueueCapacities and ResourceUsage (this union is now 
the empty default label AND 'test') and updates QueueCapacities with the label 
'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the 
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
QueueCapacities values to zero (even weight, which is wrong in my opinion) and 
loads the values again from config. The problem here is that values are reset 
for all node labels in QueueCapacities (even for 'test'), but we only load the 
values for the configured node labels (which we did not set, so it is defaulted 
to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and 
that is why the update fails. It even explains why validation passes, because 
the validation endpoint instantiates a brand new CapacityScheduler for which 
these cascade of effects can not accumulate (as there are no multiple 
updateClusterResource calls)

This scenario manifests as an error when updating via mutation API:
{noformat}
Failed to re-init queues : Parent queue 'parent' have children queue used mixed 
of weight mode, percentage and absolute mode, it is not allowed, please double 
check, details:{noformat}


> Queue weight is incorrectly reset to zero
> -----------------------------------------
>
>                 Key: YARN-11016
>                 URL: https://issues.apache.org/jira/browse/YARN-11016
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Andras Gyori
>            Assignee: Andras Gyori
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could 
> cause problems like in the following scenario:
> 1. Initializing queues
> 2. Parent 'parent' have accessibleNodeLabels set, and since accessible node 
> labels are inherited, its children, for example 'child' has 'test' label as 
> its accessible-node-label.
> 3. In LeafQueue#updateClusterResource, we call 
> LeafQueue#activateApplications, which then calls 
> LeafQueue#calculateAndGetAMResourceLimitPerPartition for each label (see 
> getNodeLabelsForQueue). 
> In this case, the labels are the accessible node labels (the inherited 
> 'test). 
> During this event the ResourceUsage object is updated for the label 'test', 
> thus extending its nodeLabelsSet with 'test'.
> 4. In a following updateClusterResource call, for example an addNode event, 
> we now have 'test' label in ResourceUsage even though it was never explicitly 
> configured and we call CSQueueUtils#updateQueueStatistics, that takes the 
> union of the node labels from QueueCapacities and ResourceUsage (this union 
> is now the empty default label AND 'test') and updates QueueCapacities with 
> the label 'perf-test'. 
> Now QueueCapacities has 'test' in its nodeLabelsSet as well!
> 5. After a reinitialization (like an update from mutation API), the 
> CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the 
> QueueCapacities values to zero (even weight, which is wrong in my opinion) 
> and loads the values again from config. 
> The problem here is that values are reset for all node labels in 
> QueueCapacities (even for 'test'), but we only load the values for the 
> configured node labels (which we did not set, so it is defaulted to the empty 
> label).
> 6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities 
> and that is why the update fails. 
> It even explains why validation passes, because the validation endpoint 
> instantiates a brand new CapacityScheduler for which these cascade of effects 
> can not accumulate (as there are no multiple updateClusterResource calls)
> This scenario manifests as an error when updating via mutation API:
> {noformat}
> Failed to re-init queues : Parent queue 'parent' have children queue used 
> mixed of weight mode, percentage and absolute mode, it is not allowed, please 
> double check, details:{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to