[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.

ASF GitHub Bot (Jira) Tue, 22 Aug 2023 02:13:43 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757299#comment-17757299
 ]


ASF GitHub Bot commented on YARN-8980:
--------------------------------------

zhengchenyu opened a new pull request, #5975:
URL: https://github.com/apache/hadoop/pull/5975

   
   ### Description of PR
   
   In order to avoid repeatedly passing NMToken to an Applicaiton, 
ResourceManager introduces NMTokenSecretManagerInRM, in which 
appAttemptToNodeKeyMap records which Nodes have applied for Token, here in the 
AppAttempt dimension. 
   For UAM, there is only one AppAttempt. Therefore, after UAM restarts, the 
previous NMToken will be lost. However, since 
NMTokenSecretManagerInRM::appAttemptToNodeKeyMap is not clear, the 
ResourceManager task will not resend the applied NMToken. So it will report the 
error that NMToken is lost. The specific errors are as follows:
   
   ```
   No NMToken sent for XX_HOST:XX_PORT 
   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:262)
 
   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtocolProxy.java:252)
 
   at 
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:137)
 
   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:433)
 
   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:146)
 
   at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:394)
 
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
   at java.lang.Thread.run(Thread.java:748)
   ```
   
   ### How was this patch tested?
   
   unit test and test in real cluster.
   
   
   ### For code changes:
   
   For now, when the current UAM is re-registered, appAttemptToNodeKeyMap will 
be cleared only when there are transferredContainers. Just move the clear code 
forward.
   




> Mapreduce application container start  fail after AM restart.
> -------------------------------------------------------------
>
>                 Key: YARN-8980
>                 URL: https://issues.apache.org/jira/browse/YARN-8980
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bibin Chundatt
>            Assignee: Shilun Fan
>            Priority: Major
>
> UAM to subclusters are always launched with keepContainers.
> On AM restart scenarios , UAM register again with RM . UAM receive running 
> containers with NMToken. NMToken received by UAM in 
> getPreviousAttemptContainersNMToken is never used by mapreduce application.  
> Federation Interceptor should take care of such scenarios too. Merge NMToken 
> received at registration to allocate response.
> Container allocation response on same node will have NMToken empty.
> issue credits : [~Nallasivan]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.

Reply via email to