[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763298#comment-17763298 ] ASF GitHub Bot commented on YARN-8980: -- slfan1989 commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1712381663 @zhengchenyu Thanks for your contribution! Merged Into Trunk. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763297#comment-17763297 ] ASF GitHub Bot commented on YARN-8980: -- slfan1989 merged PR #5975: URL: https://github.com/apache/hadoop/pull/5975 > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763081#comment-17763081 ] ASF GitHub Bot commented on YARN-8980: -- hadoop-yetus commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1711615691 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 36s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 13s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 39m 27s | | trunk passed | | +1 :green_heart: | compile | 2m 39s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 2m 19s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 1m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 42s | | trunk passed | | +1 :green_heart: | javadoc | 1m 36s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 25s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 3m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 38m 57s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 30s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 21s | | the patch passed | | +1 :green_heart: | compile | 2m 32s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 2m 32s | | the patch passed | | +1 :green_heart: | compile | 2m 14s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 2m 14s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 15s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 26s | | the patch passed | | +1 :green_heart: | javadoc | 1m 21s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 14s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 3m 34s | | the patch passed | | +1 :green_heart: | shadedclient | 38m 49s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 118m 46s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | unit | 24m 1s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 36s | | The patch does not generate ASF License warnings. | | | | 310m 32s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5975 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c9f4fe6406da 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / ed529f049b6f82bdf7876b5f8f923430c8551f68 | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/6/testReport/ | | Max. process+thread count | 900 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760904#comment-17760904 ] ASF GitHub Bot commented on YARN-8980: -- slfan1989 commented on code in PR #5975: URL: https://github.com/apache/hadoop/pull/5975#discussion_r1311548384 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingUnmanagedAM.java: ## @@ -142,14 +156,124 @@ protected void testUAMRestart(boolean keepContainers) throws Exception { numContainers = 1; am.allocate("127.0.0.1", 1000, numContainers, new ArrayList()); nm.nodeHeartbeat(true); -conts = am.allocate(new ArrayList(), -new ArrayList()).getAllocatedContainers(); +allocateResponse = am.allocate(new ArrayList(), new ArrayList()); +allocateResponse.getNMTokens().forEach(token -> tokenCacheClientSide.add(token.getNodeId())); +conts = allocateResponse.getAllocatedContainers(); while (conts.size() < numContainers) { nm.nodeHeartbeat(true); - conts.addAll(am.allocate(new ArrayList(), - new ArrayList()).getAllocatedContainers()); + allocateResponse = + am.allocate(new ArrayList(), new ArrayList()); + allocateResponse.getNMTokens().forEach(token -> tokenCacheClientSide.add(token.getNodeId())); + conts.addAll(allocateResponse.getAllocatedContainers()); Thread.sleep(100); } +checkNMTokenForContainer(tokenCacheClientSide, conts); + +rm.stop(); + } + + protected void testUAMRestartWithoutTransferContainer(boolean keepContainers) throws Exception { +// start RM +MockRM rm = new MockRM(); +rm.start(); +MockNM nm = +new MockNM("127.0.0.1:1234", 15120, rm.getResourceTrackerService()); +nm.registerNode(); +Set tokenCacheClientSide = new HashSet(); + +// create app and launch the UAM +boolean unamanged = true; +int maxAttempts = 1; +boolean waitForAccepted = true; +MockRMAppSubmissionData data = +MockRMAppSubmissionData.Builder.createWithMemory(200, rm) +.withAppName("") +.withUser(UserGroupInformation.getCurrentUser().getShortUserName()) +.withAcls(null) +.withUnmanagedAM(unamanged) +.withQueue(null) +.withMaxAppAttempts(maxAttempts) +.withCredentials(null) +.withAppType(null) +.withWaitForAppAcceptedState(waitForAccepted) +.withKeepContainers(keepContainers) +.build(); +RMApp app = MockRMAppSubmitter.submit(rm, data); + +MockAM am = MockRM.launchUAM(app, rm, nm); + +// Register for the first time +am.registerAppAttempt(); + +// Allocate two containers to UAM +int numContainers = 3; +AllocateResponse allocateResponse = +am.allocate("127.0.0.1", 1000, numContainers, new ArrayList()); +allocateResponse.getNMTokens().forEach(token -> tokenCacheClientSide.add(token.getNodeId())); +List conts = allocateResponse.getAllocatedContainers(); +while (conts.size() < numContainers) { + nm.nodeHeartbeat(true); + allocateResponse = + am.allocate(new ArrayList(), new ArrayList()); + allocateResponse.getNMTokens().forEach(token -> tokenCacheClientSide.add(token.getNodeId())); + conts.addAll(allocateResponse.getAllocatedContainers()); + Thread.sleep(100); +} +checkNMTokenForContainer(tokenCacheClientSide, conts); + +// Release all containers, then there are no transfer containfer app attempt +List releaseList = new ArrayList(); +releaseList.add(conts.get(0).getId()); +releaseList.add(conts.get(1).getId()); +releaseList.add(conts.get(2).getId()); +List finishedConts = +am.allocate(new ArrayList(), releaseList) +.getCompletedContainersStatuses(); +while (finishedConts.size() < releaseList.size()) { + nm.nodeHeartbeat(true); + finishedConts + .addAll(am + .allocate(new ArrayList(), + new ArrayList()) + .getCompletedContainersStatuses()); + Thread.sleep(100); +} + +// Register for the second time +RegisterApplicationMasterResponse response = null; +try { + response = am.registerAppAttempt(false); + // When AM restart, it means nmToken in client side should be missing + tokenCacheClientSide.clear(); + response.getNMTokensFromPreviousAttempts() + .forEach(token -> tokenCacheClientSide.add(token.getNodeId())); +} catch (InvalidApplicationMasterRequestException e) { + Assert.assertEquals(false, keepContainers); + return; +} +Assert.assertEquals("RM should not allow second register" ++ " for UAM without keep container flag ", true, keepContainers); + +// Expecting the zero running containers
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760902#comment-17760902 ] ASF GitHub Bot commented on YARN-8980: -- slfan1989 commented on code in PR #5975: URL: https://github.com/apache/hadoop/pull/5975#discussion_r1311543472 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java: ## @@ -1434,6 +1449,17 @@ private void mergeAllocateResponses(AllocateResponse mergedResponse) { } } } +// When re-register RM, client may not cache the NMToken from register response. +// Here we pass these NMToken in allocate stage. +if (nmTokenMapFromRegisterSecondaryCluster.size() > 0) { + List duplicateNmToken = new ArrayList(nmTokenMapFromRegisterSecondaryCluster); Review Comment: Why do we need to remove the token data from `nmTokenMapFromRegisterSecondaryCluster`? > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760900#comment-17760900 ] ASF GitHub Bot commented on YARN-8980: -- slfan1989 commented on code in PR #5975: URL: https://github.com/apache/hadoop/pull/5975#discussion_r1311543472 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java: ## @@ -1434,6 +1449,17 @@ private void mergeAllocateResponses(AllocateResponse mergedResponse) { } } } +// When re-register RM, client may not cache the NMToken from register response. +// Here we pass these NMToken in allocate stage. +if (nmTokenMapFromRegisterSecondaryCluster.size() > 0) { + List duplicateNmToken = new ArrayList(nmTokenMapFromRegisterSecondaryCluster); Review Comment: If nmTokenMapFromRegisterSecondaryCluster is already a set, why is deduplication necessary? > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760897#comment-17760897 ] ASF GitHub Bot commented on YARN-8980: -- slfan1989 commented on code in PR #5975: URL: https://github.com/apache/hadoop/pull/5975#discussion_r1311542321 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java: ## @@ -260,6 +261,16 @@ public class FederationInterceptor extends AbstractRequestInterceptor { private final MonotonicClock clock = new MonotonicClock(); + /* + * For UAM, keepContainersAcrossApplicationAttempts is always true. + * When re-register to RM, RM will clear node set and regenerate NMToken for transferred + * container. But If keepContainersAcrossApplicationAttempts of MA is false, AM may not + * called getNMTokensFromPreviousAttempts, so the NMToken which is pass from + * RegisterApplicationMasterResponse will be missing. Here we cache these NMToken, + * then pass to AM in allocate stage. + * */ + private Set nmTokenMapFromRegisterSecondaryCluster; Review Comment: Using Set is feasible, but should we consider using Map> for better differentiation of NMToken lists for each subcluster? > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760891#comment-17760891 ] ASF GitHub Bot commented on YARN-8980: -- slfan1989 commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1700908318 @zhengchenyu Thanks for your contribution! LGTM. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759052#comment-17759052 ] ASF GitHub Bot commented on YARN-8980: -- hadoop-yetus commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1693371460 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 53s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 11s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 36m 6s | | trunk passed | | +1 :green_heart: | compile | 2m 46s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 2m 30s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 1m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 44s | | trunk passed | | +1 :green_heart: | javadoc | 1m 42s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 32s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 3m 43s | | trunk passed | | +1 :green_heart: | shadedclient | 39m 33s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 29s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 33s | | the patch passed | | +1 :green_heart: | compile | 2m 39s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 2m 39s | | the patch passed | | +1 :green_heart: | compile | 2m 26s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 2m 26s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 19s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 41s | | the patch passed | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 20s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 4m 7s | | the patch passed | | +1 :green_heart: | shadedclient | 41m 35s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 109m 32s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | unit | 25m 13s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 303m 47s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5975 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 8e8419904a3c 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / dcefe058c926eda05c294168f4b62d5d3e28d373 | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/3/testReport/ | | Max. process+thread count | 908 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759038#comment-17759038 ] ASF GitHub Bot commented on YARN-8980: -- hadoop-yetus commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1693327592 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 39s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 8s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 30m 16s | | trunk passed | | +1 :green_heart: | compile | 2m 25s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 2m 13s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 1m 20s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 44s | | trunk passed | | +1 :green_heart: | javadoc | 1m 41s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 32s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 3m 18s | | trunk passed | | +1 :green_heart: | shadedclient | 32m 49s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 30s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 20s | | the patch passed | | +1 :green_heart: | compile | 2m 15s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 2m 15s | | the patch passed | | +1 :green_heart: | compile | 2m 6s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 2m 6s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 11s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 1m 19s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 17s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 3m 19s | | the patch passed | | +1 :green_heart: | shadedclient | 33m 29s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 100m 43s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | unit | 24m 23s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 269m 53s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5975 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux d9d17e4b84ba 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / dcefe058c926eda05c294168f4b62d5d3e28d373 | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/4/testReport/ | | Max. process+thread count | 937 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758890#comment-17758890 ] ASF GitHub Bot commented on YARN-8980: -- zhengchenyu opened a new pull request, #5975: URL: https://github.com/apache/hadoop/pull/5975 ### Description of PR In order to avoid repeatedly passing NMToken to an Applicaiton, ResourceManager introduces NMTokenSecretManagerInRM, in which appAttemptToNodeKeyMap records which Nodes have applied for Token, here in the AppAttempt dimension. For UAM, there is only one AppAttempt. Therefore, after UAM restarts, the previous NMToken will be lost. However, since NMTokenSecretManagerInRM::appAttemptToNodeKeyMap is not clear, the ResourceManager task will not resend the applied NMToken. So it will report the error that NMToken is lost. The specific errors are as follows: ``` No NMToken sent for XX_HOST:XX_PORT at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:262) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:252) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:137) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:433) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:146) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:394) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` For now, when UAM is restart and re-registered, appAttemptToNodeKeyMap will be cleared only when there are transferredContainers. ### How was this patch tested? unit test and test in real cluster. ### For code changes: Just move the clear code forward. | getKeepContainersAcrossApplicationAttempts | getUnmanagedAM | effect | | - > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758845#comment-17758845 ] ASF GitHub Bot commented on YARN-8980: -- zhengchenyu closed pull request #5975: YARN-8980. Mapreduce application container start fail after AM restart. URL: https://github.com/apache/hadoop/pull/5975 > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758796#comment-17758796 ] ASF GitHub Bot commented on YARN-8980: -- zhengchenyu commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1692675962 > ``` > List transferredContainers = getScheduler().getTransferredContainers(applicationAttemptId); > if (!transferredContainers.isEmpty()) { > response.setContainersFromPreviousAttempts(transferredContainers); > rmContext.getNMTokenSecretManager().clearNodeSetForAttempt(applicationAttemptId); > } > ``` > > **In this code, the main operations are retrieving transferred containers, updating the response, and clearing the node set. All of these operations are O(1), and they are not nested or iterated over a collection. Therefore, the overall time complexity of this code is O(1).** @whoami-anoint Do you mean the complexity of this code is not O(1) after this PR? I think the complexity of this code still is O(1) after this PR. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757873#comment-17757873 ] ASF GitHub Bot commented on YARN-8980: -- zhengchenyu commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1689488659 @goiri @slfan1989 Can you please review this PR? There is another issue to be discussed here. Here when submit UAM for federated application, keepContainersAcrossApplicationAttempts is always true. As YARN-8898 is resloved, do we need to pass this value according to the original applicationSubmissionContext? For me, I think this value should be fixed value 'true', so I did not change this. What do you think? > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhengchenyu >Priority: Major > Labels: pull-request-available > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757813#comment-17757813 ] ASF GitHub Bot commented on YARN-8980: -- hadoop-yetus commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1689305261 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 30s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 33m 12s | | trunk passed | | +1 :green_heart: | compile | 0m 45s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 0m 39s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 0m 40s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 44s | | trunk passed | | +1 :green_heart: | javadoc | 0m 45s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 37s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 19s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 29s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 33s | | the patch passed | | +1 :green_heart: | compile | 0m 35s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 0m 32s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 0m 32s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 28s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 33s | | the patch passed | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 30s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 16s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 40s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 86m 2s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 28s | | The patch does not generate ASF License warnings. | | | | 175m 29s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5975 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 673b9bcf80d7 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 9e8388737e164b959fb345c927ed7933d36434ce | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/2/testReport/ | | Max. process+thread count | 950 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757458#comment-17757458 ] ASF GitHub Bot commented on YARN-8980: -- hadoop-yetus commented on PR #5975: URL: https://github.com/apache/hadoop/pull/5975#issuecomment-1688157278 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 38s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 5s | | trunk passed | | +1 :green_heart: | compile | 1m 6s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 1m 2s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 0m 57s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 5s | | trunk passed | | +1 :green_heart: | javadoc | 1m 2s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 53s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 2m 3s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 2s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 52s | | the patch passed | | +1 :green_heart: | compile | 0m 57s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 0m 57s | | the patch passed | | +1 :green_heart: | compile | 0m 48s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 0m 48s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/1/artifact/out/blanks-eol.txt) | The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 0m 43s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 53s | | the patch passed | | +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 41s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 57s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 51s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 101m 8s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 41s | | The patch does not generate ASF License warnings. | | | | 234m 37s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5975 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e3188c65ed23 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / f6c6033c40f2539acb73648415b53651a7a339b3 | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5975/1/testReport/ | | Max. process+thread count | 948 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output |
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757299#comment-17757299 ] ASF GitHub Bot commented on YARN-8980: -- zhengchenyu opened a new pull request, #5975: URL: https://github.com/apache/hadoop/pull/5975 ### Description of PR In order to avoid repeatedly passing NMToken to an Applicaiton, ResourceManager introduces NMTokenSecretManagerInRM, in which appAttemptToNodeKeyMap records which Nodes have applied for Token, here in the AppAttempt dimension. For UAM, there is only one AppAttempt. Therefore, after UAM restarts, the previous NMToken will be lost. However, since NMTokenSecretManagerInRM::appAttemptToNodeKeyMap is not clear, the ResourceManager task will not resend the applied NMToken. So it will report the error that NMToken is lost. The specific errors are as follows: ``` No NMToken sent for XX_HOST:XX_PORT at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:262) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:252) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:137) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:433) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:146) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:394) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### How was this patch tested? unit test and test in real cluster. ### For code changes: For now, when the current UAM is re-registered, appAttemptToNodeKeyMap will be cleared only when there are transferredContainers. Just move the clear code forward. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Shilun Fan >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724130#comment-17724130 ] walhl.liu commented on YARN-8980: - I wonder if uma supporting multi-attempt can solve this problem? > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Shilun Fan >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579956#comment-17579956 ] fanshilun commented on YARN-8980: - I will continue to follow up on this pr. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > issue credits : [~Nallasivan] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682962#comment-16682962 ] Botong Huang commented on YARN-8980: I agree. I am also worried about container leaks, since the new attempt (old) AM is not even aware of the existing containers from the UAMs. Note that RM only supports one attempt for UAM and this UAM attempt is used throughout all AM attempts in home SC. I think on top of 1 you mentioned (clear token cache in RM), _FederationInterceptor_ needs to know the _keepContainer_ flag of the original AM. If it is false, after reattaching to the UAMs in _registerApplicationMaster_ it needs to release all running containers from UAM. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682770#comment-16682770 ] Bibin A Chundatt commented on YARN-8980: [~botong]/[~subru] Issue is not completely related to YARN-8898 discussion but one of the solution depends on that (Solution 2) AMProxy HA works by registering UAM with same application attempt ID . ApplicationMasterService#registerApplicationMaster {code:java} if (!(appContext.getUnmanagedAM() && appContext.getKeepContainersAcrossApplicationAttempts())) { {code} Solutions # DefaultAMSProcessor#registerApplicationMaster clear NMsecretManager after previous attempt containers are set.This will make sure allocated containers get NMTokens again for same hostname. {code:java} ApplicationSubmissionContext applicationSubmissionContext = app.getApplicationSubmissionContext(); if (applicationSubmissionContext.getUnmanagedAM() && applicationSubmissionContext .getKeepContainersAcrossApplicationAttempts()) { rmContext.getNMTokenSecretManager() .clearNodeSetForAttempt(applicationAttemptId); } response.setSchedulerResourceTypes( getScheduler().getSchedulingResourceTypes()); {code} # Handle at FederationInterceptor to add token received in recovery to first allocate response . > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682485#comment-16682485 ] Botong Huang commented on YARN-8980: Thanks [~bibinchundatt] for reporting. This is along the discussion we are having in YARN-8898. Basically it is better to use the original _ApplicationSubmissionContext_ for the app when launching the UAMs. We will probably need to go with Solution 2 discussed there: Push applicationSubmissionContext also to federationStore at router side. [~subru] what do you think? > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8980) Mapreduce application container start fail after AM restart.
[ https://issues.apache.org/jira/browse/YARN-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682255#comment-16682255 ] Bibin A Chundatt commented on YARN-8980: cc : [~botong] Mapreduce application for initial containers after restart containers are assigned without NMToken, ContainerLaunches are failing with invalid token for containers assigned from secondary subclusters. > Mapreduce application container start fail after AM restart. > - > > Key: YARN-8980 > URL: https://issues.apache.org/jira/browse/YARN-8980 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Priority: Major > > UAM to subclusters are always launched with keepContainers. > On AM restart scenarios , UAM register again with RM . UAM receive running > containers with NMToken. NMToken received by UAM in > getPreviousAttemptContainersNMToken is never used by mapreduce application. > Federation Interceptor should take care of such scenarios too. Merge NMToken > received at registration to allocate response. > Container allocation response on same node will have NMToken empty. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org