[ 
https://issues.apache.org/jira/browse/YARN-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547096#comment-16547096
 ] 

Haibo Chen commented on YARN-8436:
----------------------------------

Thanks [~wilfreds] for the patch! I have some minor comments

1) "TimSort which is used does not handle the mod of objects it sorts" => 
"TimSort which is used does not handle the concurrent modification of objects 
it is sorting"

2) Doing sleep to synchronize two threads seem flaky. Do you think having the 
Comparator and the thread share a countdownlatch is a better alternative? That 
is, when Comparator.compare() is called the latch is released indicating the 
sort has started, and in the modification thread we can wait for the latch.

> FSParentQueue: Comparison method violates its general contract
> --------------------------------------------------------------
>
>                 Key: YARN-8436
>                 URL: https://issues.apache.org/jira/browse/YARN-8436
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 3.1.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Minor
>         Attachments: YARN-8436.001.patch, YARN-8436.002.patch
>
>
> The ResourceManager can fail while sorting queues if an update comes in:
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>       at java.util.TimSort.mergeLo(TimSort.java:777)
>       at java.util.TimSort.mergeAt(TimSort.java:514)
> ...
>       at java.util.Collections.sort(Collections.java:175)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:223){code}
> The reason it breaks is a change in the sorted object itself. 
> This is why it fails:
>  * an update from a node comes in as a heartbeat.
>  * the update triggers a check to see if we can assign a container on the 
> node.
>  * walk over the queue hierarchy to find a queue to assign a container to: 
> top down.
>  * for each parent queue we sort the child queues in {{assignContainer}} to 
> decide which queue to descent into.
>  * we lock the parent queue when sort to prevent changes, but we do not lock 
> the child queues that we are sorting.
> If during this sorting a different node update changes a child queue then we 
> allow that. This means that the objects that we are trying to sort now might 
> be out of order. That causes the issue with the comparator. The comparator 
> itself is not broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to