[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757299#comment-17757299 ]
ASF GitHub Bot commented on YARN-8980: -------------------------------------- zhengchenyu opened a new pull request, #5975: URL: https://github.com/apache/hadoop/pull/5975 ### Description of PR In order to avoid repeatedly passing NMToken to an Applicaiton, ResourceManager introduces NMTokenSecretManagerInRM, in which appAttemptToNodeKeyMap records which Nodes have applied for Token, here in the AppAttempt dimension. For UAM, there is only one AppAttempt. Therefore, after UAM restarts, the previous NMToken will be lost. However, since NMTokenSecretManagerInRM::appAttemptToNodeKeyMap is not clear, the ResourceManager task will not resend the applied NMToken. So it will report the error that NMToken is lost. The specific errors are as follows: ``` No NMToken sent for XX_HOST:XX_PORT at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:262) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:252) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:137) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:433) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:146) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:394) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### How was this patch tested? unit test and test in real cluster. ### For code changes: For now, when the current UAM is re-registered, appAttemptToNodeKeyMap will be cleared only when there are transferredContainers. Just move the clear code forward. > Mapreduce application container start fail after AM restart. > ------------------------------------------------------------- > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bibin Chundatt > Assignee: Shilun Fan > Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org