Andras Gyori created YARN-11016:
-----------------------------------
Summary: Queue weight is incorrectly reset to zero
Key: YARN-11016
URL: https://issues.apache.org/jira/browse/YARN-11016
Project: Hadoop YARN
Issue Type: Bug
Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Andras Gyori
QueueCapacities#clearConfigurableFields set WEIGHT capacity to 0, which could
cause problems like in the following scenario:
1. Initializing queues
2. Parent 'parent' have accessibleNodeLabels set, and since accessible node
labels are inherited, its children, for example 'child' has 'test' label as its
accessible-node-label.
3. In LeafQueue#updateClusterResource, we call LeafQueue#activateApplications,
which then calls LeafQueue#calculateAndGetAMResourceLimitPerPartition for each
labels (see getNodeLabelsForQueue). In this case, the labels are the accessible
node labels (the inherited 'test). During this event the ResourceUsage object
is updated for the label 'test', thus extending its nodeLabelsSet with 'test'.
4. In a following updateClusterResource call, for example an addNode event, we
now have 'test' label in ResourceUsage even though it was never explicitly
configured and we call CSQueueUtils#updateQueueStatistics, that takes the union
of the node labels from QueueCapacities and ResourceUsage (this union is now
the empty default label AND 'test') and updates QueueCapacities with the label
'perf-test'. Now QueueCapacities has 'test' in its nodeLabelsSet as well!
5. After a reinitialization (like an update from mutation API) the
CSQueueUtils#loadCapacitiesByLabelsFromCon is called, which resets the
QueueCapacities values to zero (even weight, which is wrong in my opinion) and
loads the values again from config. The problem here is that values are reset
for all node labels in QueueCapacities (even for 'test'), but we only load the
values for the configured node labels (which we did not set, so it is defaulted
to the empty label).
6. Now all children of 'parent' have weight=0 for 'test' in QueueCapacities and
that is why the update fails. It even explains why validation passes, because
the validation endpoint instantiates a brand new CapacityScheduler for which
these cascade of effects can not accumulate (as there are no multiple
updateClusterResource calls)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]