[
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373412#comment-17373412
]
yanbin.zhang commented on YARN-10178:
-
If you turn off the preemption function and solve the problem
?[~zhuqi][~wangda][~tuyu]
> Global Scheduler async thread crash caused by 'Comparison method violates its
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch,
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread,
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException:
> Comparison method violates its general contract!
>at
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has few require
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign
> container,and construct a CSAssignment struct, and use
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService will tryCommit this CSAssignment,look tryCommit
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
> long commitStart = System.nanoTime();
> ResourceCommitRequest request =
> (ResourceCommitRequest) r;
>
> ...
> boolean isSuccess = false;
> if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
> if (app.accept(cluster, request,