[ 
https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519522#comment-15519522
 ] 

Zephyr Guo commented on YARN-4743:
----------------------------------

The bug cause by NaN.

I wrote a test case to verify {{FairShareComparator}}(see patch), and then 
found that the {{FairShareComparator}} can not deal with weights=0 correctly. 
We dump the collection(see timsort.log) that broke sorting from our cluster to 
confirm whether it is 0. The weight should be greater than or equal to 1(I 
think).  In fact, weight would be 0.

We get NaN when memorySize=0 and weight=0.
{code}
useToWeightRatio1 = s1.getResourceUsage().getMemorySize() /
  s1.getWeights().getWeight(ResourceType.MEMORY)
{code}

I'm not sure whether this is a bug for weight.We can talk about this in another 
issue.
If weight = 0 , the demand memory must be 0 and 
{{yarn.scheduler.fair.sizebasedweight}} is enable.
Formula:  weight = log2(1 + demand) 

it seems that a meaningful weight must be greater than or equal to 1. So I just 
fix weight to 1 in patch. Anyway we need more strict code.


BTW:
I think there are still problems related to concurrency(Like the description 
says that).
If you enable {{yarn.resourcemanager.work-preserving-recovery.enabled}}, 
{{recoverContainer}} method would be invoked in another thread. The method can 
modify {{attemptResourceUsage}}. This will go wrong when you are sorting 
{{FSAppAttempt}}.


> ResourceManager crash because TimSort
> -------------------------------------
>
>                 Key: YARN-4743
>                 URL: https://issues.apache.org/jira/browse/YARN-4743
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.4
>            Reporter: Zephyr Guo
>            Assignee: Yufei Gu
>             Fix For: 3.0.0-alpha1
>
>         Attachments: YARN-4743-v1.patch, YARN-CDH5.4.7.patch, timsort.log
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>          at java.util.TimSort.mergeHi(TimSort.java:868)
>          at java.util.TimSort.mergeAt(TimSort.java:485)
>          at java.util.TimSort.mergeCollapse(TimSort.java:410)
>          at java.util.TimSort.sort(TimSort.java:214)
>          at java.util.TimSort.sort(TimSort.java:173)
>          at java.util.Arrays.sort(Arrays.java:659)
>          at java.util.Collections.sort(Collections.java:217)
>          at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>          at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>          at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>          at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>          at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>          at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>          at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>          at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting 
> {{runnableApps}}.
> {code:title=FSLeafQueue.java}
>     Comparator<Schedulable> comparator = policy.getComparator();
>     writeLock.lock();
>     try {
>       Collections.sort(runnableApps, comparator);
>     } finally {
>       writeLock.unlock();
>     }
>     readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ......
>           s1.getResourceUsage(), minShare1);
>       boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>           s2.getResourceUsage(), minShare2);
>       minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare1, 
> ONE).getMemory();
>       minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare2, 
> ONE).getMemory();
> ......
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is 
> unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
>     // Here the getPreemptedResources() always return zero, except in
>     // a preemption round
>     return Resources.subtract(getCurrentConsumption(), 
> getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
>     return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ......
>     Resources.addTo(currentConsumption, rmContainer.getContainer()
>       .getResource());
> ......
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to