[
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201747#comment-15201747
]
Zephyr Guo commented on YARN-4743:
----------------------------------
I am trying to solve the issue, but I am failed.
In my opinion, the issue cause by concurrent operation on {{FSAppAttempt}}.When
{{FSLeafQueue}} is sorting FSAppAttempt, the inner {{Resource}} of FsAppAttempt
is modified.In this case, {{FairShareComparator}} may cannot work
correctly.Base on this idea, I write YARN-4743-cdh5.4.7.patch(I have
attached).The patch use snapshot to protect elements during the sorting.Sadly,
this problem doesn't resolve with the patch.I got same exception on sorting and
more frequently crash.I begin to doubt whether the comparator have a problem
really.I reviewed {{FairShareComparator}} code and simulate all cases, but did
not found any bugs.
I need some idea. I'd like to verify two things.1)Can inner Resource be
modified during the sorting?Who could review it for me? 2)Does comparator also
have mistakes really or my patch is incorrect?
I doubt that float-point precision in comparator, but it's hard to reappear in
test cluster(never reappear). It happen with low probability in larger cluster.
> ResourceManager crash because TimSort
> -------------------------------------
>
> Key: YARN-4743
> URL: https://issues.apache.org/jira/browse/YARN-4743
> Project: Hadoop YARN
> Issue Type: Bug
> Components: fairscheduler
> Affects Versions: 2.6.4
> Reporter: Zephyr Guo
> Assignee: Yufei Gu
>
> {code}
> 2016-02-26 14:08:50,821 FATAL
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:868)
> at java.util.TimSort.mergeAt(TimSort.java:485)
> at java.util.TimSort.mergeCollapse(TimSort.java:410)
> at java.util.TimSort.sort(TimSort.java:214)
> at java.util.TimSort.sort(TimSort.java:173)
> at java.util.Arrays.sort(Arrays.java:659)
> at java.util.Collections.sort(Collections.java:217)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
> at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
> Comparator<Schedulable> comparator = policy.getComparator();
> writeLock.lock();
> try {
> Collections.sort(runnableApps, comparator);
> } finally {
> writeLock.unlock();
> }
> readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ......
> s1.getResourceUsage(), minShare1);
> boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
> s2.getResourceUsage(), minShare2);
> minShareRatio1 = (double) s1.getResourceUsage().getMemory()
> / Resources.max(RESOURCE_CALCULATOR, null, minShare1,
> ONE).getMemory();
> minShareRatio2 = (double) s2.getResourceUsage().getMemory()
> / Resources.max(RESOURCE_CALCULATOR, null, minShare2,
> ONE).getMemory();
> ......
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is
> unstable.
> {code:title=FSAppAttempt.java}
> @Override
> public Resource getResourceUsage() {
> // Here the getPreemptedResources() always return zero, except in
> // a preemption round
> return Resources.subtract(getCurrentConsumption(),
> getPreemptedResources());
> }
> {code}
> {code:title=SchedulerApplicationAttempt}
> public Resource getCurrentConsumption() {
> return currentConsumption;
> }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ......
> Resources.addTo(currentConsumption, rmContainer.getContainer()
> .getResource());
> ......
> }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)