[ 
https://issues.apache.org/jira/browse/YARN-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299003#comment-14299003
 ] 

Jack Chen commented on YARN-3112:
---------------------------------

I have found the cause for this error: the new launched appattempt will 
transfer the old containers from previous attempts, so the Nodeset in 
NMTokenSecretManagerInRM.java will be filled. When the new appattempt to get 
the allocated containers via  pullNewlyAllocatedContainersAndNMTokens(), it 
will get "null" nmToken because of the full Nodeset in createAndGetNMToken(). 
The Null nmToken will be returned to the ContainerLauncher, so the new 
container will fail in the launch. What i have done is clear the nodeset in  
pullNewlyAllocatedContainersAndNMTokens() before the creation of container and 
node tokens. 

   public synchronized ContainersAndNMTokensAllocation
  438       pullNewlyAllocatedContainersAndNMTokens() {                         
                                 
  439     List<Container> returnContainerList =
  440         new ArrayList<Container>(newlyAllocatedContainers.size());
  441     List<NMToken> nmTokens = new ArrayList<NMToken>();
+ 442     // clear the nodeset for NMTokens
+ 443     
rmContext.getNMTokenSecretManager().clearNodeSetForAttempt(getApplicationAttemptId());
  444     for (Iterator<RMContainer> i = newlyAllocatedContainers.iterator(); i
  445       .hasNext();) {
  446       RMContainer rmContainer = i.next();
  447       Container container = rmContainer.getContainer();
  448       try {
  449         // create container token and NMToken altogether.
  450         
container.setContainerToken(rmContext.getContainerTokenSecretManager()
  451           .createContainerToken(container.getId(), container.getNodeId(),
  452             getUser(), container.getResource(), container.getPriority(),
  453             rmContainer.getCreationTime(), this.logAggregationContext));
  454         NMToken nmToken =
  455             
rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(),
  456               getApplicationAttemptId(), container);
+ 457         //check whether nmtoken is null
+ 458         LOG.info("[hchen]NMToken for container "+container.getId()+" 
NMToken:"+nmToken);
  459         if (nmToken != null) {
  460           nmTokens.add(nmToken);
  461         }
  462       } catch (IllegalArgumentException e) {
  463         // DNS might be down, skip returning this container.
  464         LOG.error("Error trying to assign container token and NM token 
to" +
  465             " an allocated container " + container.getId(), e);
  466         continue;
  467       }
  468       returnContainerList.add(container);
  469       i.remove();
  470       rmContainer.handle(new 
RMContainerEvent(rmContainer.getContainerId(),
  471         RMContainerEventType.ACQUIRED));
  472     }
  473     return new ContainersAndNMTokensAllocation(returnContainerList, 
nmTokens);
  474   }

> AM restart and keep containers from previous attempts, then new container 
> launch failed
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-3112
>                 URL: https://issues.apache.org/jira/browse/YARN-3112
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications, resourcemanager
>    Affects Versions: 2.6.0
>         Environment: in real linux cluster
>            Reporter: Jack Chen
>
> This error is very similar to YARN-1795, YARN-1839, but i have check the 
> solution of those jira, the patches are already included in my version. I 
> think this error is caused by the different NMTokens between old and new 
> appattempts. New AM has inherited the old tokens from previous AM according 
> to my configuration (keepContainers=true), so the token for new containers 
> are replaced by the old one in the NMTokenCache.
> 206 2015-01-29 10:04:49,603 ERROR [ContainerLauncher #0] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for      container_1422546145900_0001_02_000002 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for ixk02:47625
>  207 ›   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProt
>      ocolProxy.java:256)
>  208 ›   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtoc
>      olProxy.java:246)
>  209 ›   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:132)
>  210 ›   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:401)
>  211 ›   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>  212 ›   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:367)
>  213 ›   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  214 ›   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  215 ›   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to