[
https://issues.apache.org/jira/browse/YARN-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739011#comment-13739011
]
Zhijie Shen commented on YARN-292:
----------------------------------
Did more investigation on this issue:
{code}
2012-12-26 08:41:15,030 ERROR
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler:
Calling allocate on removed or non existant application
appattempt_1356385141279_49525_000001
{code}
This log indicates that ArrayIndexOutOfBoundsException happens because the
application is not found. There're three possibilities where the application is
not found:
1. The application hasn't been added into FiFoScheduler#applications. If it is
the case, FiFoScheduler will not send APP_ACCEPTED event to the corresponding
RMAppAttemptImpl. Without APP_ACCEPTED event, RMAppAttemptImpl will not enter
SCHEDULED state, and will not go through AMContainerAllocatedTransition to
ALLOCATED_SAVING consequently. Therefore, this case is impossible.
2. The application has already been removed from FiFoScheduler#applications. To
trigger the removal operation, the corresponding RMAppAttemptImpl needs to go
through BaseFinalTransition.
It is worth mentioning first that RMAppAttemptImpl's transitions are executed
on the thread of AsyncDispatcher, while YarnScheduler#handle is invoked on the
thread of SchedulerEventDispatcher. The two threads will execute in parallel,
indicating that the process of an RMAppAttemptEvent and that of a
SchedulerEvent may interpolate. However, the processes of two
RMAppAttemptEvents or two SchedulerEvents will not.
Therefore, AMContainerAllocatedTransition will not start before
RMAppAttemptImpl has already finished BaseFinalTransition. Nevertheless, when
RMAppAttemptImpl goes through BaseFinalTransition, it will enter an final state
as well, such that AMContainerAllocatedTransition will not happen at all. In
conclusion, this case is impossible as well.
3. The application is in FiFoScheduler#applications, but RMAppAttemptImpl
doesn't get it. First of all, FiFoScheduler#applications is a TreeMap, which is
not thread safe (FairScheduler#applications is a HashMap while
CapcityScheduler#applications is a ConcurrentHashMap). Second, the methods of
accessing the map are not consistently synchronized, thus, read and write on
the same map can operate simultaneously. RMAppAttemptImpl on the thread of
AsyncDispatcher will eventually call FiFoScheduler#applications#get in
AMContainerAllocatedTransition, while FiFoScheduler on thread of
SchedulerEventDispatcher will use FiFoScheduler#applications#add|remove.
Therefore, getting null when the application actually exists happens under a
big number of concurrent operations.
Please feel free to correct me if you think there's something wrong or missing
with the analysis. I'm going to work on a patch to fix the problem.
> ResourceManager throws ArrayIndexOutOfBoundsException while handling
> CONTAINER_ALLOCATED for application attempt
> ----------------------------------------------------------------------------------------------------------------
>
> Key: YARN-292
> URL: https://issues.apache.org/jira/browse/YARN-292
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.0.1-alpha
> Reporter: Devaraj K
> Assignee: Zhijie Shen
>
> {code:xml}
> 2012-12-26 08:41:15,030 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler:
> Calling allocate on removed or non existant application
> appattempt_1356385141279_49525_000001
> 2012-12-26 08:41:15,031 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
> handling event type CONTAINER_ALLOCATED for applicationAttempt
> application_1356385141279_49525
> java.lang.ArrayIndexOutOfBoundsException: 0
> at java.util.Arrays$ArrayList.get(Arrays.java:3381)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
> at java.lang.Thread.run(Thread.java:662)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira