[
https://issues.apache.org/jira/browse/YARN-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
J.Andreina updated YARN-933:
----------------------------
Description:
Hostname enabled.
am max retries configured as 3 at client and RM side.
Step 1: Install cluster with NM on 2 Machines
Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But
using Hostname should fail
Step 3: Execute a job
Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done ,
connection loss happened.
Observation :
==========
After AppAttempt_1 has moved to failed state ,release of container for
AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed.
1. Then again retry for AppAttempt_1 happens.
2. Again RM side it is trying to launch AppAttempt_1, hence fails with
InvalidStateTransitonException
3. Client got exited after AppAttempt_1 is been finished [But actually job is
still running ], while the appattempts configured is 3 and rest appattempts are
all sponed and running.
RMLogs:
======
2013-07-17 16:22:51,013 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1373952096466_0056_000001 State change from SCHEDULED to ALLOCATED
2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s);
maxRetries=45
2013-07-17 16:36:07,091 INFO
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
Expired:container_1373952096466_0056_01_000001 Timed out after 600 secs
2013-07-17 16:36:07,093 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1373952096466_0056_01_000001 Container Transitioned from ACQUIRED to
EXPIRED
2013-07-17 16:36:07,093 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Registering appattempt_1373952096466_0056_000002
2013-07-17 16:36:07,131 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application appattempt_1373952096466_0056_000001 is done. finalState=FAILED
2013-07-17 16:36:07,131 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Application removed - appId: application_1373952096466_0056 user: Rex
leaf-queue of parent: root #applications: 35
2013-07-17 16:36:07,132 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application Submission: appattempt_1373952096466_0056_000002,
2013-07-17 16:36:07,138 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1373952096466_0056_000002 State change from SUBMITTED to SCHEDULED
2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s);
maxRetries=45
2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s);
maxRetries=45
2013-07-17 16:38:56,207 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error
launching appattempt_1373952096466_0056_000001. Got exception:
java.lang.reflect.UndeclaredThrowableException
2013-07-17 16:38:56,207 ERROR
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
LAUNCH_FAILED at FAILED
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:662)
Client Logs
========
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=host-10-18-40-15/10.18.40.59:8020]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:573)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
2013-07-17 16:37:05,987 ERROR org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:Rex (auth:SIMPLE)
cause:org.apache.hadoop.net.ConnectTimeoutException: Call From
HOST-10-18-91-55/10.18.40.57 to host-10-18-40-15:8020 failed on socket timeout
exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=host-10-18-40-15/10.18.40.59:8020]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
was:
Hostname enabled.
am max retries configured as 3 at client and RM side.
Step 1: Install cluster in HA mode with NM on 2 Machines
Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But
using Hostname should fail
Step 3: Execute a job
Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done ,
connection loss happened.
Observation :
==========
After AppAttempt_1 has moved to failed state ,release of container for
AppAttempt_1 and Application removal are successful. New AppAttempt_2 is sponed.
1. Then again retry for AppAttempt_1 happens.
2. Again RM side it is trying to launch AppAttempt_1, hence fails with
InvalidStateTransitonException
3. Client got exited after AppAttempt_1 is been finished [But actually job is
still running ], while the appattempts configured is 3 and rest appattempts are
all sponed and running.
RMLogs:
======
2013-07-17 16:22:51,013 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1373952096466_0056_000001 State change from SCHEDULED to ALLOCATED
2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s);
maxRetries=45
2013-07-17 16:36:07,091 INFO
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
Expired:container_1373952096466_0056_01_000001 Timed out after 600 secs
2013-07-17 16:36:07,093 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1373952096466_0056_01_000001 Container Transitioned from ACQUIRED to
EXPIRED
2013-07-17 16:36:07,093 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Registering appattempt_1373952096466_0056_000002
2013-07-17 16:36:07,131 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application appattempt_1373952096466_0056_000001 is done. finalState=FAILED
2013-07-17 16:36:07,131 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Application removed - appId: application_1373952096466_0056 user: Rex
leaf-queue of parent: root #applications: 35
2013-07-17 16:36:07,132 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application Submission: appattempt_1373952096466_0056_000002,
2013-07-17 16:36:07,138 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1373952096466_0056_000002 State change from SUBMITTED to SCHEDULED
2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s);
maxRetries=45
2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s);
maxRetries=45
2013-07-17 16:38:56,207 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error
launching appattempt_1373952096466_0056_000001. Got exception:
java.lang.reflect.UndeclaredThrowableException
2013-07-17 16:38:56,207 ERROR
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
LAUNCH_FAILED at FAILED
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:662)
Client Logs
========
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=host-10-18-40-15/10.18.40.59:8020]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:573)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
2013-07-17 16:37:05,987 ERROR org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:Rex (auth:SIMPLE)
cause:org.apache.hadoop.net.ConnectTimeoutException: Call From
HOST-10-18-91-55/10.18.40.57 to host-10-18-40-15:8020 failed on socket timeout
exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=host-10-18-40-15/10.18.40.59:8020]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
> After an AppAttempt_1 got failed [ removal and releasing of container is done
> , AppAttempt_2 is scheduled ] again relaunching of AppAttempt_1 throws
> Exception at RM .And client exited before appattempt retries got over
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-933
> URL: https://issues.apache.org/jira/browse/YARN-933
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.0.5-alpha
> Reporter: J.Andreina
>
> Hostname enabled.
> am max retries configured as 3 at client and RM side.
> Step 1: Install cluster with NM on 2 Machines
> Step 2: Make Ping using ip from RM machine to NM1 machine as successful ,But
> using Hostname should fail
> Step 3: Execute a job
> Step 4: After AM [ AppAttempt_1 ] allocation to NM1 machine is done ,
> connection loss happened.
> Observation :
> ==========
> After AppAttempt_1 has moved to failed state ,release of container for
> AppAttempt_1 and Application removal are successful. New AppAttempt_2 is
> sponed.
> 1. Then again retry for AppAttempt_1 happens.
> 2. Again RM side it is trying to launch AppAttempt_1, hence fails with
> InvalidStateTransitonException
> 3. Client got exited after AppAttempt_1 is been finished [But actually job is
> still running ], while the appattempts configured is 3 and rest appattempts
> are all sponed and running.
> RMLogs:
> ======
> 2013-07-17 16:22:51,013 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1373952096466_0056_000001 State change from SCHEDULED to ALLOCATED
> 2013-07-17 16:35:48,171 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: host-10-18-40-15/10.18.40.59:8048. Already tried 36 time(s);
> maxRetries=45
> 2013-07-17 16:36:07,091 INFO
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
> Expired:container_1373952096466_0056_01_000001 Timed out after 600 secs
> 2013-07-17 16:36:07,093 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
> container_1373952096466_0056_01_000001 Container Transitioned from ACQUIRED
> to EXPIRED
> 2013-07-17 16:36:07,093 INFO
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
> Registering appattempt_1373952096466_0056_000002
> 2013-07-17 16:36:07,131 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Application appattempt_1373952096466_0056_000001 is done. finalState=FAILED
> 2013-07-17 16:36:07,131 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
> Application removed - appId: application_1373952096466_0056 user: Rex
> leaf-queue of parent: root #applications: 35
> 2013-07-17 16:36:07,132 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Application Submission: appattempt_1373952096466_0056_000002,
> 2013-07-17 16:36:07,138 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1373952096466_0056_000002 State change from SUBMITTED to SCHEDULED
> 2013-07-17 16:36:30,179 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: host-10-18-40-15/10.18.40.59:8048. Already tried 38 time(s);
> maxRetries=45
> 2013-07-17 16:38:36,203 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: host-10-18-40-15/10.18.40.59:8048. Already tried 44 time(s);
> maxRetries=45
> 2013-07-17 16:38:56,207 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error
> launching appattempt_1373952096466_0056_000001. Got exception:
> java.lang.reflect.UndeclaredThrowableException
> 2013-07-17 16:38:56,207 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> LAUNCH_FAILED at FAILED
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:630)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:495)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:476)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
> at java.lang.Thread.run(Thread.java:662)
> Client Logs
> ========
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=host-10-18-40-15/10.18.40.59:8020]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:573)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
> 2013-07-17 16:37:05,987 ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:Rex (auth:SIMPLE) cause:org.apache.hadoop.net.ConnectTimeoutException:
> Call From HOST-10-18-91-55/10.18.40.57 to host-10-18-40-15:8020 failed on
> socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException:
> 20000 millis timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=host-10-18-40-15/10.18.40.59:8020]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira