Jeongin Ju created YARN-10892:
---------------------------------

             Summary: YARN Preemption Monitor got 
java.util.ConcurrentModificationException when three or more partitions exists
                 Key: YARN-10892
                 URL: https://issues.apache.org/jira/browse/YARN-10892
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 3.1.2
            Reporter: Jeongin Ju


On our cluster with a large number of NMs, preemption monitor thread 
consistently got java.util.ConcurrentModificationException when specific 
conditions met.

What We found as conditions are as follow. (All 4 conditions should be met)
 # There are at least two non-exclusive partitions except default partition 
(let me call the partitions as X and Y partition)
 # app1 in the queue belonging to default partition (let me call the queue as 
'dev' queue) borrowed resources from both X, Y partitions 
 # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' 
because resources are consumed by app1
 # Preemption monitor can clear borrowed resources from X or Y when the 
container of app1 is preempted.  

Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
HashMap key(partition name) while iterating HashMap.

Logically, it is correct because we didn't traverse the same partition again on 
this 'selectCandidates'. However HashMap structure does not allow modification 
while iterating.

I made test case to reproduce the error 
case(testResourceTypesInterQueuePreemptionWithThreePartitions).

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!

 
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption 
checker, skip this run..., exception=
java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
        at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
        at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
        at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
        at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
        at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
        at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)}}

{{}}

{{}}

{{}}
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to