[
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Szilard Nemeth updated YARN-10178:
----------------------------------
Description:
Stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread,
Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException:
Comparison method violates its general contract!
at
java.util.TimSort.mergeHi(TimSort.java:899)
at java.util.TimSort.mergeAt(TimSort.java:516)
at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
at java.util.TimSort.sort(TimSort.java:254)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at java.util.Collections.sort(Collections.java:177)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
{code}
In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort
has a few requirements:
{code:java}
1.x.compareTo(y) != y.compareTo(x)
2.x>y,y>z --> x > z
3.x=y, x.compareTo(z) == y.compareTo(z)
{code}
If the Array / List does not satisfy any of these requirements, TimSort will
throw a java.lang.IllegalArgumentException.
1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare
method, we can see that Capacity Scheduler these queue fields in order to
compare resource usage:
{code:java}
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
{code}
2. In CS, during the execution of AsyncScheduleThread while the queues are
being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the queue
to assign the container to this IllegalArgumentException is thrown.
3. If we take a look into the ResourceCommitterService method, it tries to
commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit
function, the queue resource usage is being updated.
{code:java}
public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
boolean updatePending) {
long commitStart = System.nanoTime();
ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
(ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
...
boolean isSuccess = false;
if (attemptId != null) {
FiCaSchedulerApp app = getApplicationAttempt(attemptId);
// Required sanity check for attemptId - when async-scheduling enabled,
// proposal might be outdated if AM failover just finished
// and proposal queue was not be consumed in time
if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
if (app.accept(cluster, request, updatePending)
&& app.apply(cluster, request, updatePending)) { // apply this
resource
...
}
}
}
return isSuccess;
}
}
{code}
{code:java}
public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
FiCaSchedulerNode> request, boolean updatePending) {
...
if (!reReservation) {
getCSLeafQueue().apply(cluster, request);
}
...
}
{code}
4.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply
invokes
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
{code:java}
void allocateResource(Resource clusterResource,
Resource resource, String nodePartition) {
try {
writeLock.lock(); // only lock leaf queue lock
queueUsage.incUsed(nodePartition, resource);
++numContainers;
CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
this, labelManager, nodePartition); // there will update queue
statistics
} finally {
writeLock.unlock();
}
}
{code}
5. We can see that ResourceCommitterService will only lock the Leaf Queue to
update the queue statistics, but the AsyncScheduleThread do only lock the Root
Queue (in ParentQueue#sortAndGetChildrenAllocationIterator)
{code:java}
private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
String partition) {
try {
readLock.lock();
return queueOrderingPolicy.getAssignmentIterator(partition);
} finally {
readLock.unlock();
}
}
{code}
so if multi threads are comparing queue usage statistics and
ResourceCommitterService applies Leaf Queue changes in statistics in a
concurrent manner, it will break the TimSort algorithm's requirements, causing
a thread crash.
was:
Global Scheduler Async Thread crash stack
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread,
Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException:
Comparison method violates its general contract!
at
java.util.TimSort.mergeHi(TimSort.java:899)
at java.util.TimSort.mergeAt(TimSort.java:516)
at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
at java.util.TimSort.sort(TimSort.java:254)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at java.util.Collections.sort(Collections.java:177)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
{code}
JAVA 8 Arrays.sort default use timsort algo, and timsort has few require
{code:java}
1.x.compareTo(y) != y.compareTo(x)
2.x>y,y>z --> x > z
3.x=y, x.compareTo(z) == y.compareTo(z)
{code}
if not Arrays paramters not satify this require,TimSort will throw
'java.lang.IllegalArgumentException'
look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know
Capacity Scheduler use this these queue resource usage to compare
{code:java}
AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity
{code}
In Capacity Scheduler Global Scheduler AsyncThread use
PriorityUtilizationQueueOrderingPolicy function to choose queue to assign
container,and construct a CSAssignment struct, and use
submitResourceCommitRequest function add CSAssignment to backlogs
ResourceCommitterService will tryCommit this CSAssignment,look tryCommit
function,there will update queue resource usage
{code:java}
public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
boolean updatePending) {
long commitStart = System.nanoTime();
ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
(ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
...
boolean isSuccess = false;
if (attemptId != null) {
FiCaSchedulerApp app = getApplicationAttempt(attemptId);
// Required sanity check for attemptId - when async-scheduling enabled,
// proposal might be outdated if AM failover just finished
// and proposal queue was not be consumed in time
if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
if (app.accept(cluster, request, updatePending)
&& app.apply(cluster, request, updatePending)) { // apply this
resource
...
}
}
}
return isSuccess;
}
}
{code}
{code:java}
public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
FiCaSchedulerNode> request, boolean updatePending) {
...
if (!reReservation) {
getCSLeafQueue().apply(cluster, request);
}
...
}
{code}
LeafQueue.apply invok allocateResource
{code:java}
void allocateResource(Resource clusterResource,
Resource resource, String nodePartition) {
try {
writeLock.lock(); // only lock leaf queue lock
queueUsage.incUsed(nodePartition, resource);
++numContainers;
CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
this, labelManager, nodePartition); // there will update queue
statistics
} finally {
writeLock.unlock();
}
}
{code}
we found ResourceCommitterService will only lock leaf queue to update queue
statistics, but AsyncThread use sortAndGetChildrenAllocationIterator only lock
queue root queue lock
{code:java}
ParentQueue.java
private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
String partition) {
try {
readLock.lock();
return queueOrderingPolicy.getAssignmentIterator(partition);
} finally {
readLock.unlock();
}
}
{code}
so if multi async thread compare queue usage statistics and
ResourceCommitterService apply leaf queue change statistics concurrent, will
break TimSort algo required, and cause thread crash
> Global Scheduler async thread crash caused by 'Comparison method violates its
> general contract
> ----------------------------------------------------------------------------------------------
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Affects Versions: 3.2.1
> Reporter: tuyu
> Assignee: Andras Gyori
> Priority: Major
> Fix For: 3.4.0, 2.10.2, 3.3.2, 3.2.4
>
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch,
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch,
> YARN-10178.006.patch, YARN-10178.branch-2.10.001.patch
>
>
> Stack trace:
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread,
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException:
> Comparison method violates its general contract!
> at
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort
> has a few requirements:
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> If the Array / List does not satisfy any of these requirements, TimSort will
> throw a java.lang.IllegalArgumentException.
>
> 1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare
> method, we can see that Capacity Scheduler these queue fields in order to
> compare resource usage:
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
>
> 2. In CS, during the execution of AsyncScheduleThread while the queues are
> being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the
> queue to assign the container to this IllegalArgumentException is thrown.
>
> 3. If we take a look into the ResourceCommitterService method, it tries to
> commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit
> function, the queue resource usage is being updated.
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
> long commitStart = System.nanoTime();
> ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
> (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
>
> ...
> boolean isSuccess = false;
> if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
> if (app.accept(cluster, request, updatePending)
> && app.apply(cluster, request, updatePending)) { // apply this
> resource
> ...
> }
> }
> }
> return isSuccess;
> }
> }
> {code}
> {code:java}
> public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
> FiCaSchedulerNode> request, boolean updatePending) {
> ...
> if (!reReservation) {
> getCSLeafQueue().apply(cluster, request);
> }
> ...
> }
> {code}
> 4.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply
> invokes
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
> {code:java}
> void allocateResource(Resource clusterResource,
> Resource resource, String nodePartition) {
> try {
> writeLock.lock(); // only lock leaf queue lock
> queueUsage.incUsed(nodePartition, resource);
>
> ++numContainers;
>
> CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
> this, labelManager, nodePartition); // there will update queue
> statistics
> } finally {
> writeLock.unlock();
> }
> }
> {code}
> 5. We can see that ResourceCommitterService will only lock the Leaf Queue to
> update the queue statistics, but the AsyncScheduleThread do only lock the
> Root Queue (in ParentQueue#sortAndGetChildrenAllocationIterator)
> {code:java}
> private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
> String partition) {
> try {
> readLock.lock();
> return queueOrderingPolicy.getAssignmentIterator(partition);
> } finally {
> readLock.unlock();
> }
> }
> {code}
> so if multi threads are comparing queue usage statistics and
> ResourceCommitterService applies Leaf Queue changes in statistics in a
> concurrent manner, it will break the TimSort algorithm's requirements,
> causing a thread crash.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]