[
https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999094#comment-13999094
]
Jian He commented on YARN-2065:
-------------------------------
Looked at the exception posted in SLIDER-34, the problem is that AM can get
new containers from RM, but cannot launch the containers on NM because of the
following method.
The token is generated with the previous container's attempt Id, instead of the
current attemptId. And NM is checking the attemptId from NMToken against the
attemptId from the container.
{code}
public NMToken createAndGetNMToken(String applicationSubmitter,
ApplicationAttemptId appAttemptId, Container container) {
try {
this.readLock.lock();
HashSet<NodeId> nodeSet = this.appAttemptToNodeKeyMap.get(appAttemptId);
NMToken nmToken = null;
if (nodeSet != null) {
if (!nodeSet.contains(container.getNodeId())) {
LOG.info("Sending NMToken for nodeId : " + container.getNodeId()
+ " for container : " + container.getId());
Token token =
createNMToken(**container.getId().getApplicationAttemptId()**,
container.getNodeId(), applicationSubmitter);
nmToken = NMToken.newInstance(container.getNodeId(), token);
nodeSet.add(container.getNodeId());
}
}
return nmToken;
} finally {
this.readLock.unlock();
}
}
{code}
Changing this method will fix this problem.
But another problem is that
ContainerMangerImpl#authorizeGetAndStopContainerRequest also requires the
previous NMToken to talk to the previous container and current NMToken to talk
with current container. Luckily, it's now not throwing exception but just log
error messages. we also need to change the NM side to check against the
applicationId rather than attemptId.
> AM cannot create new containers after restart-NM token from previous attempt
> used
> ---------------------------------------------------------------------------------
>
> Key: YARN-2065
> URL: https://issues.apache.org/jira/browse/YARN-2065
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.4.0
> Reporter: Steve Loughran
>
> Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot
> create new containers.
> The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it
> kills the AM, then kills a container while the AM is down, which triggers a
> reallocation of a container, leading to this failure.
--
This message was sent by Atlassian JIRA
(v6.2#6252)