[ 
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313286#comment-14313286
 ] 

Jian He commented on YARN-933:
------------------------------

bq. You should not ignore RMAppAttemptEventType.LAUNCHED? We will have to 
explicitly kill the AppAttempt and the AM in this case
The AM here is being killed. Allocated state gets the kill event and kill the 
AM(send the clean up event to the AM launcher) and then moves to the 
final_saving state.  

> Potential InvalidStateTransitonException: Invalid event: LAUNCHED at 
> FINAL_SAVING
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-933
>                 URL: https://issues.apache.org/jira/browse/YARN-933
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.5-alpha
>            Reporter: J.Andreina
>            Assignee: Rohith
>         Attachments: 0001-YARN-933.patch, 0001-YARN-933.patch, 
> YARN-933.3.patch, YARN-933.patch
>
>
> am max retries configured as 3 at client and RM side.
> Step 1: Install cluster with NM on 2 Machines 
> Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But 
> using Hostname should fail
> Step 3: Execute a job
> Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done , 
> connection loss happened.
> Observation :
> ==========
> After AppAttempt_1 has moved to failed state ,release of container for 
> AppAttempt_1 and Application removal are successful. New AppAttempt_2 is 
> sponed.
> 1. Then again retry for AppAttempt_1 happens.
> 2. Again RM side it is trying to launch AppAttempt_1, hence fails with 
> InvalidStateTransitonException
> 3. Client got exited after AppAttempt_1 is been finished [But actually job is 
> still running ], while the appattempts configured is 3 and rest appattempts 
> are all sponed and running.
> RMLogs:
> ======
> 2013-07-17 16:22:51,013 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1373952096466_0056_000001 State change from SCHEDULED to ALLOCATED
> 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s); 
> maxRetries=45
> 2013-07-17 16:36:07,091 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1373952096466_0056_01_000001 Timed out after 600 secs
> 2013-07-17 16:36:07,093 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1373952096466_0056_01_000001 Container Transitioned from ACQUIRED 
> to EXPIRED
> 2013-07-17 16:36:07,093 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Registering appattempt_1373952096466_0056_000002
> 2013-07-17 16:36:07,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application appattempt_1373952096466_0056_000001 is done. finalState=FAILED
> 2013-07-17 16:36:07,131 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Application removed - appId: application_1373952096466_0056 user: Rex 
> leaf-queue of parent: root #applications: 35
> 2013-07-17 16:36:07,132 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application Submission: appattempt_1373952096466_0056_000002, 
> 2013-07-17 16:36:07,138 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1373952096466_0056_000002 State change from SUBMITTED to SCHEDULED
> 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s); 
> maxRetries=45
> 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s); 
> maxRetries=45
> 2013-07-17 16:38:56,207 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error 
> launching appattempt_1373952096466_0056_000001. Got exception: 
> java.lang.reflect.UndeclaredThrowableException
> 2013-07-17 16:38:56,207 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> LAUNCH_FAILED at FAILED
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
>  at java.lang.Thread.run(Thread.java:662)
> Client Logs
> ========
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis 
> timeout while waiting for channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending 
> remote=host-10-18-40-15/10.18.40.59:8020]
>  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:573)
>  at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
> 2013-07-17 16:37:05,987 ERROR 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:Rex (auth:SIMPLE) cause:org.apache.hadoop.net.ConnectTimeoutException: 
> Call From HOST-10-18-91-55/10.18.40.57 to host-10-18-40-15:8020 failed on 
> socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 
> 20000 millis timeout while waiting for channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending 
> remote=host-10-18-40-15/10.18.40.59:8020]; For more details see:  
> http://wiki.apache.org/hadoop/SocketTimeout



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to