[
https://issues.apache.org/jira/browse/YARN-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yingda Chen updated YARN-2433:
--
Description:
With Hadoop 2.4, container retention is supported across AM crash-and-restart.
However, after an AM is restarted with containers retained, it appears to be
using the stale token to start new container. This leads to the error below. To
truly support container retention, AM should be able to communicate with
previous container(s) with the old token and ask for new container with new
token.
This could be similar to YARN-1321 which was reported and fixed earlier.
ERROR:
Unauthorized request to start container. \nNMToken for application attempt :
appattempt_1408130608672_0065_01 was used for starting container with
container token issued for application attempt :
appattempt_1408130608672_0065_02
STACK trace:
{code}
hadoop.ipc.ProtobufRpcEngine$Invoker.invoke
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103:
Response <- YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454:
startContainers {services_meta_data { key: "mapreduce_shuffle" value:
"\000\0004\372" } failed_requests { container_id { app_attempt_id {
application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2
} exception { message: "Unauthorized request to start container. \nNMToken for
application attempt : appattempt_1408130608672_0065_01 was used for
starting container with container token issued for application attempt :
appattempt_1408130608672_0065_02" trace:
"org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start
container. \nNMToken for application attempt :
appattempt_1408130608672_0065_01 was used for starting container with
container token issued for application attempt :
appattempt_1408130608672_0065_02\r\n\tat
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:48)\r\n\tat
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeStartRequest(ContainerManagerImpl.java:508)\r\n\tat
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainerInternal(ContainerManagerImpl.java:571)\r\n\tat
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:538)\r\n\tat
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)\r\n\tat
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)\r\n\tat
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)\r\n\tat
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)\r\n\tat
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)\r\n\tat
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)\r\n\tat
java.security.AccessController.doPrivileged(Native Method)\r\n\tat
javax.security.auth.Subject.doAs(Subject.java:415)\r\n\tat
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)\r\n\tat
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)\r\n" class_name:
"org.apache.hadoop.yarn.exceptions.YarnException" } }}
{code}
was:
With Hadoop 2.4, container retention is supported across AM crash-and-restart.
However, after an AM is restarted with containers retained, it appears to be
using the stale token to start new container. This leads to the error below. To
truly support container retention, AM should be able to communicate with
previous container(s) with the old token and ask for new container with new
token.
This could be similar to YARN-1321 which was reported and fixed earlier.
ERROR:
Unauthorized request to start container. \nNMToken for application attempt :
appattempt_1408130608672_0065_01 was used for starting container with
container token issued for application attempt :
appattempt_1408130608672_0065_02
STACK trace:
hadoop.ipc.ProtobufRpcEngine$Invoker.invoke
org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0 | 103:
Response <- YINGDAC1.redmond.corp.microsoft.com/10.121.136.231:45454:
startContainers {services_meta_data { key: "mapreduce_shuffle" value:
"\000\0004\372" } failed_requests { container_id { app_attempt_id {
application_id { id: 65 cluster_timestamp: 1408130608672 } attemptId: 2 } id: 2
} exception { message: "Unauthorized request to start container. \nNMToken for
application attempt : appattempt_1408130608672_0065_01 was used for
starting container with container token issued for application attempt :
appattempt_1408130608672_0065_02" trace:
"org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start
container. \nNMToken for application attempt :
appattemp