[ 
https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739011#comment-13739011
 ] 

Zhijie Shen commented on YARN-292:
----------------------------------

Did more investigation on this issue:

{code}
2012-12-26 08:41:15,030 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
Calling allocate on removed or non existant application 
appattempt_1356385141279_49525_000001
{code}
This log indicates that ArrayIndexOutOfBoundsException happens because the 
application is not found. There're three possibilities where the application is 
not found:

1. The application hasn't been added into FiFoScheduler#applications. If it is 
the case, FiFoScheduler will not send APP_ACCEPTED event to the corresponding 
RMAppAttemptImpl. Without APP_ACCEPTED event, RMAppAttemptImpl will not enter 
SCHEDULED state, and will not go through AMContainerAllocatedTransition to 
ALLOCATED_SAVING consequently. Therefore, this case is impossible.

2. The application has already been removed from FiFoScheduler#applications. To 
trigger the removal operation, the corresponding RMAppAttemptImpl needs to go 
through BaseFinalTransition. 

It is worth mentioning first that RMAppAttemptImpl's transitions are executed 
on the thread of AsyncDispatcher, while YarnScheduler#handle is invoked on the 
thread of SchedulerEventDispatcher. The two threads will execute in parallel, 
indicating that the process of an RMAppAttemptEvent and that of a 
SchedulerEvent may interpolate. However, the processes of two 
RMAppAttemptEvents or two SchedulerEvents will not.

Therefore, AMContainerAllocatedTransition will not start before 
RMAppAttemptImpl has already finished BaseFinalTransition. Nevertheless, when 
RMAppAttemptImpl goes through BaseFinalTransition, it will enter an final state 
as well, such that AMContainerAllocatedTransition will not happen at all. In 
conclusion, this case is impossible as well.

3. The application is in FiFoScheduler#applications, but RMAppAttemptImpl 
doesn't get it. First of all, FiFoScheduler#applications is a TreeMap, which is 
not thread safe (FairScheduler#applications is a HashMap while 
CapcityScheduler#applications is a ConcurrentHashMap). Second, the methods of 
accessing the map are not consistently synchronized, thus, read and write on 
the same map can operate simultaneously. RMAppAttemptImpl on the thread of 
AsyncDispatcher will eventually call FiFoScheduler#applications#get in 
AMContainerAllocatedTransition, while FiFoScheduler on thread of 
SchedulerEventDispatcher will use FiFoScheduler#applications#add|remove. 
Therefore, getting null when the application actually exists happens under a 
big number of concurrent operations.

Please feel free to correct me if you think there's something wrong or missing 
with the analysis. I'm going to work on a patch to fix the problem.
                
> ResourceManager throws ArrayIndexOutOfBoundsException while handling 
> CONTAINER_ALLOCATED for application attempt
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-292
>                 URL: https://issues.apache.org/jira/browse/YARN-292
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.0.1-alpha
>            Reporter: Devaraj K
>            Assignee: Zhijie Shen
>
> {code:xml}
> 2012-12-26 08:41:15,030 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
> Calling allocate on removed or non existant application 
> appattempt_1356385141279_49525_000001
> 2012-12-26 08:41:15,031 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_ALLOCATED for applicationAttempt 
> application_1356385141279_49525
> java.lang.ArrayIndexOutOfBoundsException: 0
>       at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
>       at java.lang.Thread.run(Thread.java:662)
>  {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to