[ 
https://issues.apache.org/jira/browse/YARN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-2910:
----------------------------------------
    Attachment: YARN-2910.4.patch

I did not change the assignment :-(

yes, the {{when(schedulable.getResourceUsage()).thenReturn(smallResource);}} 
should not have been in the patch, my mistake. Not sure how that ended up in 
the patch I used it during development but not in the last tests.

On my machine the test failed with just adding applications. The issue seems to 
be in the initialisation of the application attempt. When I added debug into 
the test run I can see the initialisation of the app attempt in the mock taking 
up a lot of time which meant that the {{getResourceUsage}} almost always ran 
over an empty list unless the number of iterations was raised above 1000. As 
soon as I moved the creation out of the thread the failure occurs within 5 
iterations of the {{getResourceUsage}} call in the second thread after adding 
less than 15 or so app instances.

I have attached an updated patch which passes with the new code and has a 100% 
failure rate with the old code. This version of the test runs faster and is 
more reliable than the previous ones.

> FSLeafQueue can throw ConcurrentModificationException
> -----------------------------------------------------
>
>                 Key: YARN-2910
>                 URL: https://issues.apache.org/jira/browse/YARN-2910
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Ray Chiang
>         Attachments: FSLeafQueue_concurrent_exception.txt, 
> YARN-2910.004.patch, YARN-2910.1.patch, YARN-2910.2.patch, YARN-2910.3.patch, 
> YARN-2910.4.patch, YARN-2910.patch
>
>
> The list that maintains the runnable and the non runnable apps are a standard 
> ArrayList but there is no guarantee that it will only be manipulated by one 
> thread in the system. This can lead to the following exception:
> {noformat}
> 2014-11-12 02:29:01,169 ERROR [RMCommunicator Allocator] 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN 
> CONTACTING RM.
> java.util.ConcurrentModificationException: 
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
> at java.util.ArrayList$Itr.next(ArrayList.java:831)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.getResourceUsage(FSLeafQueue.java:147)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getHeadroom(FSAppAttempt.java:180)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:516)
> {noformat}
> Full stack trace in the attached file.
> We should guard against that by using a thread safe version from 
> java.util.concurrent.CopyOnWriteArrayList



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to