Ferenc Erdelyi created YARN-11639:
-------------------------------------

             Summary: ConcurrentModificationException and NPE in 
PriorityUtilizationQueueOrderingPolicy
                 Key: YARN-11639
                 URL: https://issues.apache.org/jira/browse/YARN-11639
             Project: Hadoop YARN
          Issue Type: Bug
          Components: capacity scheduler
            Reporter: Ferenc Erdelyi


When dynamic queue creation is enabled in weight mode and the deletion policy 
coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
resources because of either ConcurrentModificationExceptionor NPE in 
PriorityUtilizationQueueOrderingPolicy.

Reproduced the NPE issue in Java8 and Java11 environment:
{code:java}
... INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Removing queue: root.dyn.PmvkMgrEBQppu
2024-01-02 17:00:59,399 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-11,5,main] threw an Exception.
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.<init>(PriorityUtilizationQueueOrderingPolicy.java:225)
        at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
        at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
        at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
        at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1111)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
{code}

Observed the ConcurrentModificationException in Java8 environment, but could 
not reproduce yet:
{code:java}
2023-10-27 02:50:37,584 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, 
main] threw an Exception.
java.util.ConcurrentModificationException
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza
ueOrderingPolicy.Java:260)
{code}

The immediate (temporary) remedy to keep the cluster going is to restart the RM.
The workaround is to disable the deletion of dynamically created child queues. 






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to