[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used

Jian He (JIRA) Fri, 16 May 2014 07:31:20 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999094#comment-13999094
 ]


Jian He commented on YARN-2065:
-------------------------------

Looked at the exception posted in SLIDER-34, the problem is that  AM can get 
new containers from RM, but cannot launch the containers on NM because of the 
following method.
The token is generated with the previous container's attempt Id, instead of the 
current attemptId. And NM is checking the attemptId from NMToken against the 
attemptId from the container.
{code}
  public NMToken createAndGetNMToken(String applicationSubmitter,
      ApplicationAttemptId appAttemptId, Container container) {
    try {
      this.readLock.lock();
      HashSet<NodeId> nodeSet = this.appAttemptToNodeKeyMap.get(appAttemptId);
      NMToken nmToken = null;
      if (nodeSet != null) {
        if (!nodeSet.contains(container.getNodeId())) {
          LOG.info("Sending NMToken for nodeId : " + container.getNodeId()
              + " for container : " + container.getId());
          Token token =
              createNMToken(**container.getId().getApplicationAttemptId()**,
                container.getNodeId(), applicationSubmitter);
          nmToken = NMToken.newInstance(container.getNodeId(), token);
          nodeSet.add(container.getNodeId());
        }
      }
      return nmToken;
    } finally {
      this.readLock.unlock();
    }
  }
{code}
Changing this method will fix this problem. 

But another problem is that 
ContainerMangerImpl#authorizeGetAndStopContainerRequest also requires the 
previous NMToken to talk to the previous container and current NMToken to talk 
with current container. Luckily, it's now not throwing exception but just log 
error messages.  we also need to change the NM side to check against the 
applicationId rather than attemptId. 

> AM cannot create new containers after restart-NM token from previous attempt 
> used
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-2065
>                 URL: https://issues.apache.org/jira/browse/YARN-2065
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Steve Loughran
>
> Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot 
> create new containers.
> The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it 
> kills the AM, then kills a container while the AM is down, which triggers a 
> reallocation of a container, leading to this failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used

Reply via email to