[ https://issues.apache.org/jira/browse/YARN-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096250#comment-14096250 ]
Jian Fang commented on YARN-2416: --------------------------------- Here is the log for the events and state transitions. 2014-07-05 08:57:41,748 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from NEW to ALLOCATED 2014-07-05 08:57:41,760 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from ALLOCATED to ACQUIRED 2014-07-05 08:57:41,833 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1404549222428_0001_01_000001 : $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx2048m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr 2014-07-05 08:57:42,737 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from ACQUIRED to RUNNING 2014-07-05 08:58:54,290 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_01_000001 Container Transitioned from RUNNING to KILLED 2014-07-05 08:58:54,290 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1404549222428_0001_01_000001 in state: KILLED event:KILL 2014-07-05 08:58:54,394 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1404549222428_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING 2014-07-05 08:58:54,394 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1404549222428_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED 2014-07-05 08:58:54,395 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1404549222428_0001_000002 2014-07-05 08:58:54,397 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1404549222428_0001_02_000001 : $JAVA_HOME/bin/java -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Xmx2048m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr 2014-07-05 08:58:55,396 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING 2014-07-05 08:59:05,020 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1404549222428_0001_000002 2014-07-05 08:59:05,021 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.198.10.201 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1404549222428_0001 APPATTEMPTID=appattempt_1404549222428_0001_000002 2014-07-05 08:59:12,653 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1404549222428_0001_01_000001, NodeId: ip-10-198-22-164.us-west-1.compute.internal:9103, NodeHttpAddress: ip-10-198-22-164.us-west-1.compute.internal:9035, Resource: <memory:3378, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.198.22.164:9103 }, ] for AM appattempt_1404549222428_0001_000001 2014-07-05 08:59:12,653 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1404549222428_0001_02_000001, NodeId: ip-10-198-10-201.us-west-1.compute.internal:9103, NodeHttpAddress: ip-10-198-10-201.us-west-1.compute.internal:9035, Resource: <memory:3378, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.198.10.201:9103 }, ] for AM appattempt_1404549222428_0001_000002 > InvalidStateTransitonException in ResourceManager if AMLauncher does not > receive response for startContainers() call in time > ---------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-2416 > URL: https://issues.apache.org/jira/browse/YARN-2416 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.4.0 > Reporter: Jian Fang > > AMLauncher calls startContainers(allRequests) to launch a container for > application master. Normally, the call comes back immediately so that the > RMAppAttempt changes its state from ALLOCATED to LAUNCHED. > However, we do observed that in some cases, the RPC call came back very late > but the AM container was already started. Because the RMAppAttempt stuck in > ALLOCATED state, once resource manager received the REGISTERED event from the > application master, it threw InvalidStateTransitonException as follows. > 2014-07-05 08:59:05,021 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > REGISTERED at ALLOCATED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > For subsequent STATUS_UPDATE and CONTAINER_ALLOCATED events for this job, > resource manager kept throwing InvalidStateTransitonException. > 2014-07-05 08:59:06,152 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > STATUS_UPDATE at ALLOCATED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > 2014-07-05 08:59:07,779 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1404549222428_0001_02_000002 Container Transitioned from NEW to > ALLOCATED > 2014-07-05 08:59:07,779 ERROR > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > CONTAINER_ALLOCATED at ALLOCATED > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)