Benjamin Teke created YARN-11503:
------------------------------------
Summary: Adding queues separately in short succession with
Mutation API will stop CS allocating new containers
Key: YARN-11503
URL: https://issues.apache.org/jira/browse/YARN-11503
Project: Hadoop YARN
Issue Type: Bug
Components: capacity scheduler
Affects Versions: 3.4.0
Reporter: Benjamin Teke
Adding multiple queues in short succession via Mutation API will result in some
race condition when adding the partition metrics for those queues, as noted by
the unhandled exception:
{code:java}
2023-05-09 18:25:36,301 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Initializing root.eca_m
2023-05-09 18:25:36,301 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager:
Initialized queue: root.eca_m
2023-05-09 18:25:36,359 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
LeafQueue:root.eca_mupdate max app related, maxApplications=1000,
maxApplicationsPerUser=1000, Abs Cap:0.0, Cap: 0.0, MaxCap : 1.0
2023-05-09 18:25:36,359 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
LeafQueue:root.eca_mupdate max app related, maxApplications=1000,
maxApplicationsPerUser=1000, Abs Cap:NaN, Cap: NaN, MaxCap : NaN
2023-05-09 18:25:36,401 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Initializing root.eca_m
2023-05-09 18:25:36,401 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager:
Initialized queue: root.eca_m
2023-05-09 18:25:36,484 ERROR
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
Thread[Thread-26,5,main] threw an Exception.
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
2023-05-09 18:25:36,531 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Initializing root.eca_m
2023-05-09 18:25:36,531 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
root: re-configured queue: root.eca_m: capacity=0.0, absoluteCapacity=0.0,
usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0,
numApps=0, numContainers=0, effectiveMinResource=<memory:1152000, vCores:359> ,
effectiveMaxResource=<memory:2304000, vCores:718>
{code}
Initing the leaf queue root.eca_m should only happen once in during a reinit
(twice if the validation endpoint is used), but in this case it happened thrice
under a quarter of a second. This results in an unhandled exception in the
async scheduling thread, which then will block new container allocation
(existing ones can transition to other states however).
{code:java}
2023-05-09 18:25:36,484 ERROR
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
Thread[Thread-26,5,main] threw an Exception.
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:355)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:614)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1545)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1198)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1109)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:927)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
{code}
Even though Mutation API wasn't designed for this, the scheduling thread
shouldn't react like to API calls.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]