Jeongin Ju created YARN-10892:
---------------------------------
Summary: YARN Preemption Monitor got
java.util.ConcurrentModificationException when three or more partitions exists
Key: YARN-10892
URL: https://issues.apache.org/jira/browse/YARN-10892
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 3.1.2
Reporter: Jeongin Ju
On our cluster with a large number of NMs, preemption monitor thread
consistently got java.util.ConcurrentModificationException when specific
conditions met.
What We found as conditions are as follow. (All 4 conditions should be met)
# There are at least two non-exclusive partitions except default partition
(let me call the partitions as X and Y partition)
# app1 in the queue belonging to default partition (let me call the queue as
'dev' queue) borrowed resources from both X, Y partitions
# app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING'
because resources are consumed by app1
# Preemption monitor can clear borrowed resources from X or Y when the
container of app1 is preempted.
Main problem is that FifoCandiatesSelector.selectCandidates tried to remove
HashMap key(partition name) while iterating HashMap.
Logically, it is correct because we didn't traverse the same partition again on
this 'selectCandidates'. However HashMap structure does not allow modification
while iterating.
I made test case to reproduce the error
case(testResourceTypesInterQueuePreemptionWithThreePartitions).
We found and patched our cluster on 3.1.2 but it seems trunk still has the same
problem.
I attached patch based on the trunk.
Thanks!
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption
checker, skip this run..., exception=
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)}}
{{}}
{{}}
{{}}
{quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]