Benjamin Teke created YARN-11503:
------------------------------------

             Summary: Adding queues separately in short succession with 
Mutation API will stop CS allocating new containers
                 Key: YARN-11503
                 URL: https://issues.apache.org/jira/browse/YARN-11503
             Project: Hadoop YARN
          Issue Type: Bug
          Components: capacity scheduler
    Affects Versions: 3.4.0
            Reporter: Benjamin Teke


Adding multiple queues in short succession via Mutation API will result in some 
race condition when adding the partition metrics for those queues, as noted by 
the unhandled exception:

{code:java}
2023-05-09 18:25:36,301 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Initializing root.eca_m
2023-05-09 18:25:36,301 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager:
 Initialized queue: root.eca_m
2023-05-09 18:25:36,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 LeafQueue:root.eca_mupdate max app related, maxApplications=1000, 
maxApplicationsPerUser=1000, Abs Cap:0.0, Cap: 0.0, MaxCap : 1.0
2023-05-09 18:25:36,359 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue:
 LeafQueue:root.eca_mupdate max app related, maxApplications=1000, 
maxApplicationsPerUser=1000, Abs Cap:NaN, Cap: NaN, MaxCap : NaN
2023-05-09 18:25:36,401 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Initializing root.eca_m
2023-05-09 18:25:36,401 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager:
 Initialized queue: root.eca_m
2023-05-09 18:25:36,484 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-26,5,main] threw an Exception.
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
2023-05-09 18:25:36,531 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Initializing root.eca_m
2023-05-09 18:25:36,531 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
root: re-configured queue: root.eca_m: capacity=0.0, absoluteCapacity=0.0, 
usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, 
numApps=0, numContainers=0, effectiveMinResource=<memory:1152000, vCores:359> , 
effectiveMaxResource=<memory:2304000, vCores:718>
{code}

Initing the leaf queue root.eca_m should only happen once in during a reinit 
(twice if the validation endpoint is used), but in this case it happened thrice 
under a quarter of a second. This results in an unhandled exception in the 
async scheduling thread, which then will block new container allocation 
(existing ones can transition to other states however).

{code:java}
2023-05-09 18:25:36,484 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-26,5,main] threw an Exception.
org.apache.hadoop.metrics2.MetricsException: Metrics source 
PartitionQueueMetrics,partition=,q0=root,q1=eca_m already exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:355)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:614)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1545)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1198)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1109)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:927)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
{code}

Even though Mutation API wasn't designed for this, the scheduling thread 
shouldn't react like to API calls.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to