[
https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757299#comment-17757299
]
ASF GitHub Bot commented on YARN-8980:
--------------------------------------
zhengchenyu opened a new pull request, #5975:
URL: https://github.com/apache/hadoop/pull/5975
### Description of PR
In order to avoid repeatedly passing NMToken to an Applicaiton,
ResourceManager introduces NMTokenSecretManagerInRM, in which
appAttemptToNodeKeyMap records which Nodes have applied for Token, here in the
AppAttempt dimension.
For UAM, there is only one AppAttempt. Therefore, after UAM restarts, the
previous NMToken will be lost. However, since
NMTokenSecretManagerInRM::appAttemptToNodeKeyMap is not clear, the
ResourceManager task will not resend the applied NMToken. So it will report the
error that NMToken is lost. The specific errors are as follows:
```
No NMToken sent for XX_HOST:XX_PORT
at
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:262)
at
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:252)
at
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:137)
at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:433)
at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:146)
at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:394)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### How was this patch tested?
unit test and test in real cluster.
### For code changes:
For now, when the current UAM is re-registered, appAttemptToNodeKeyMap will
be cleared only when there are transferredContainers. Just move the clear code
forward.
> Mapreduce application container start fail after AM restart.
> -------------------------------------------------------------
>
> Key: YARN-8980
> URL: https://issues.apache.org/jira/browse/YARN-8980
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bibin Chundatt
> Assignee: Shilun Fan
> Priority: Major
>
> UAM to subclusters are always launched with keepContainers.
> On AM restart scenarios , UAM register again with RM . UAM receive running
> containers with NMToken. NMToken received by UAM in
> getPreviousAttemptContainersNMToken is never used by mapreduce application.
> Federation Interceptor should take care of such scenarios too. Merge NMToken
> received at registration to allocate response.
> Container allocation response on same node will have NMToken empty.
> issue credits : [~Nallasivan]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]