[jira] [Created] (MAPREDUCE-6855) Specify charset when create String in CredentialsTestJob
Akira Ajisaka created MAPREDUCE-6855: Summary: Specify charset when create String in CredentialsTestJob Key: MAPREDUCE-6855 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6855 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Akira Ajisaka Priority: Minor {code} String secretValueStr = new String (secretValue); {code} should be {code} String secretValueStr = new String(secretValue, StandardCharsets.UTF_8); {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6753) Variable in byte printed directly in mapreduce client
[ https://issues.apache.org/jira/browse/MAPREDUCE-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated MAPREDUCE-6753: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha3 2.8.1 2.9.0 Status: Resolved (was: Patch Available) Committed this to trunk, branch-2, and branch-2.8. Thanks all who contributed to this issue. > Variable in byte printed directly in mapreduce client > - > > Key: MAPREDUCE-6753 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6753 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: client >Affects Versions: 2.7.2 >Reporter: Nemo Chen >Assignee: Kai Sasaki > Labels: easyfix, easytest > Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3 > > Attachments: MAPREDUCE-6753.01.patch, MAPREDUCE-6753.02.patch, > MAPREDUCE-6753.03.patch > > > Similar to the fix for HBASE-623, in file: > hadoop-rel-release-2.7.2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/CredentialsTestJob.java > in line 61, the system out print a byte variable secretValue. > {code} > System.out.println(secretValue); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6753) Variable in byte printed directly in mapreduce client
[ https://issues.apache.org/jira/browse/MAPREDUCE-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893835#comment-15893835 ] Akira Ajisaka commented on MAPREDUCE-6753: -- +1, thanks [~lewuathe] and [~haibochen]. > Variable in byte printed directly in mapreduce client > - > > Key: MAPREDUCE-6753 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6753 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: client >Affects Versions: 2.7.2 >Reporter: Nemo Chen >Assignee: Kai Sasaki > Labels: easyfix, easytest > Attachments: MAPREDUCE-6753.01.patch, MAPREDUCE-6753.02.patch, > MAPREDUCE-6753.03.patch > > > Similar to the fix for HBASE-623, in file: > hadoop-rel-release-2.7.2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/CredentialsTestJob.java > in line 61, the system out print a byte variable secretValue. > {code} > System.out.println(secretValue); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition
[ https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-6852: -- Fix Version/s: 3.0.0-alpha3 > Job#updateStatus() failed with NPE due to race condition > > > Key: MAPREDUCE-6852 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch > > > Like MAPREDUCE-6762, we found this issue in a cluster where Pig query > occasionally failed on NPE - "Pig uses JobControl API to track MR job status, > but sometimes Job History Server failed to flush job meta files to HDFS which > caused the status update failed." Beside NPE in > o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the > exception is as following: > {noformat} > Caused by: java.lang.NullPointerException > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833) > at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320) > at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604) > {noformat} > We found state here is null. However, we already check the job state to be > RUNNING as code below: > {noformat} > public boolean isComplete() throws IOException { > ensureState(JobState.RUNNING); > updateStatus(); > return status.isJobComplete(); > } > {noformat} > The only possible reason here is two threads are calling here for the same > time: ensure state first, then one thread update the state to null while the > other thread hit NPE issue here. > We should fix this NPE exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition
[ https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892827#comment-15892827 ] Junping Du commented on MAPREDUCE-6852: --- Thanks Jian for review and commit! > Job#updateStatus() failed with NPE due to race condition > > > Key: MAPREDUCE-6852 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.9.0, 3.0.0-alpha3 > > Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch > > > Like MAPREDUCE-6762, we found this issue in a cluster where Pig query > occasionally failed on NPE - "Pig uses JobControl API to track MR job status, > but sometimes Job History Server failed to flush job meta files to HDFS which > caused the status update failed." Beside NPE in > o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the > exception is as following: > {noformat} > Caused by: java.lang.NullPointerException > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833) > at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320) > at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604) > {noformat} > We found state here is null. However, we already check the job state to be > RUNNING as code below: > {noformat} > public boolean isComplete() throws IOException { > ensureState(JobState.RUNNING); > updateStatus(); > return status.isJobComplete(); > } > {noformat} > The only possible reason here is two threads are calling here for the same > time: ensure state first, then one thread update the state to null while the > other thread hit NPE issue here. > We should fix this NPE exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition
[ https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892808#comment-15892808 ] Hudson commented on MAPREDUCE-6852: --- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11331 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11331/]) MAPREDUCE-6852. Job#updateStatus() failed with NPE due to race (jianhe: rev 747bafaf969857b66233a8b4660590bdd712ed7d) * (edit) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Job.java > Job#updateStatus() failed with NPE due to race condition > > > Key: MAPREDUCE-6852 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.9.0 > > Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch > > > Like MAPREDUCE-6762, we found this issue in a cluster where Pig query > occasionally failed on NPE - "Pig uses JobControl API to track MR job status, > but sometimes Job History Server failed to flush job meta files to HDFS which > caused the status update failed." Beside NPE in > o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the > exception is as following: > {noformat} > Caused by: java.lang.NullPointerException > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833) > at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320) > at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604) > {noformat} > We found state here is null. However, we already check the job state to be > RUNNING as code below: > {noformat} > public boolean isComplete() throws IOException { > ensureState(JobState.RUNNING); > updateStatus(); > return status.isJobComplete(); > } > {noformat} > The only possible reason here is two threads are calling here for the same > time: ensure state first, then one thread update the state to null while the > other thread hit NPE issue here. > We should fix this NPE exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition
[ https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated MAPREDUCE-6852: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.9.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-2, thanks Junping ! > Job#updateStatus() failed with NPE due to race condition > > > Key: MAPREDUCE-6852 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Junping Du >Assignee: Junping Du > Fix For: 2.9.0 > > Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch > > > Like MAPREDUCE-6762, we found this issue in a cluster where Pig query > occasionally failed on NPE - "Pig uses JobControl API to track MR job status, > but sometimes Job History Server failed to flush job meta files to HDFS which > caused the status update failed." Beside NPE in > o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the > exception is as following: > {noformat} > Caused by: java.lang.NullPointerException > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323) > at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833) > at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320) > at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604) > {noformat} > We found state here is null. However, we already check the job state to be > RUNNING as code below: > {noformat} > public boolean isComplete() throws IOException { > ensureState(JobState.RUNNING); > updateStatus(); > return status.isJobComplete(); > } > {noformat} > The only possible reason here is two threads are calling here for the same > time: ensure state first, then one thread update the state to null while the > other thread hit NPE issue here. > We should fix this NPE exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892631#comment-15892631 ] Haibo Chen commented on MAPREDUCE-6834: --- bq. IMHO, we shouldn't do this, because it's not the fix for root cause. It looks like workaround Sorry for the confusion. I should have been more specific, I meant to say that the root cause analysis looks correct to me. I agree with you that we should not follow the proposed fix there. > MR application fails with "No NMToken sent" exception after MRAppMaster > recovery > > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn >Affects Versions: 2.7.0 > Environment: Centos 7 >Reporter: Aleksandr Balitsky >Assignee: Aleksandr Balitsky >Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt > and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application > fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container > launch failed for container_1482408247195_0002_02_11 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for node1:43037 > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > *Problem*: > When RMCommunicator sends "registerApplicationMaster" request to RM, RM > generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted > to RMCommunicator in RegisterApplicationMasterResponse > (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in > RMCommunicator.register method. RM don't transmit tese tokens again for other > allocated requests, but we don't have these tokens in NMTokenCache. > Accordingly we get "No NMToken sent for node" exception. > I have found that this issue appears after changes from the > https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed > > I tried to do the same scenario without the commit and application completed > successfully after RMAppMaster recovery -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892455#comment-15892455 ] Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:59 PM: --- [~jlowe], I'm not sure that YARN-3112 the same issue that I reported. >From YARN-3112 description: {code} New AM has inherited the old tokens from previous AM according to my configuration (keepContainers=true), so the token for new containers are replaced by the old one in the NMTokenCache. {code} I have not used "keep-containers-across-application-attempts" feature, and I definitely didn't faced with problem when the token for new containers are replaced by the old one in the NMTokenCache (my debug can confirm it), because AM was restarted and all old tokens was removed. was (Author: abalitsky1): [~jlowe], I'm not sure that YARN-3112 the same issue that I reported. >From YARN-3112 description: {code} New AM has inherited the old tokens from previous AM according to my configuration (keepContainers=true), so the token for new containers are replaced by the old one in the NMTokenCache. {code} I have not used "keep-containers-across-application-attempts" feature, and I definitely didn't faced with problem when the token for new containers are replaced by the old one in the NMTokenCache (my debug can confirm it), cause AM was restarted and all old tokens was removed. > MR application fails with "No NMToken sent" exception after MRAppMaster > recovery > > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn >Affects Versions: 2.7.0 > Environment: Centos 7 >Reporter: Aleksandr Balitsky >Assignee: Aleksandr Balitsky >Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt > and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application > fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container > launch failed for container_1482408247195_0002_02_11 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for node1:43037 > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > *Problem*: > When RMCommunicator sends "registerApplicationMaster" request to RM, RM > generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted > to RMCommunicator in RegisterApplicationMasterResponse > (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in > RMCommunicator.register method. RM don't transmit tese tokens again for other > allocated requests, but we don't have these tokens in NMTokenCache. > Accordingly we get "No NMToken sent for node" exception. > I have found that this issue appears after changes from the > https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed > > I tried to do the same scenario without the commit and application completed > successfully after RMAppMaster recovery -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands,
[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892455#comment-15892455 ] Aleksandr Balitsky commented on MAPREDUCE-6834: --- [~jlowe], I'm not sure that YARN-3112 the same issue that I reported. >From YARN-3112 description: {code} New AM has inherited the old tokens from previous AM according to my configuration (keepContainers=true), so the token for new containers are replaced by the old one in the NMTokenCache. {code} I have not used "keep-containers-across-application-attempts" feature, and I definitely didn't faced with problem when the token for new containers are replaced by the old one in the NMTokenCache (my debug can confirm it), cause AM was restarted and all old tokens was removed. > MR application fails with "No NMToken sent" exception after MRAppMaster > recovery > > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn >Affects Versions: 2.7.0 > Environment: Centos 7 >Reporter: Aleksandr Balitsky >Assignee: Aleksandr Balitsky >Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt > and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application > fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container > launch failed for container_1482408247195_0002_02_11 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for node1:43037 > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > *Problem*: > When RMCommunicator sends "registerApplicationMaster" request to RM, RM > generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted > to RMCommunicator in RegisterApplicationMasterResponse > (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in > RMCommunicator.register method. RM don't transmit tese tokens again for other > allocated requests, but we don't have these tokens in NMTokenCache. > Accordingly we get "No NMToken sent for node" exception. > I have found that this issue appears after changes from the > https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed > > I tried to do the same scenario without the commit and application completed > successfully after RMAppMaster recovery -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892442#comment-15892442 ] Aleksandr Balitsky commented on MAPREDUCE-6834: --- [~jlowe], yep, 001 patch isn't good from design point of view. I'm going to investigate the code again to find the root cause. > MR application fails with "No NMToken sent" exception after MRAppMaster > recovery > > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn >Affects Versions: 2.7.0 > Environment: Centos 7 >Reporter: Aleksandr Balitsky >Assignee: Aleksandr Balitsky >Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt > and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application > fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container > launch failed for container_1482408247195_0002_02_11 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for node1:43037 > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > *Problem*: > When RMCommunicator sends "registerApplicationMaster" request to RM, RM > generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted > to RMCommunicator in RegisterApplicationMasterResponse > (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in > RMCommunicator.register method. RM don't transmit tese tokens again for other > allocated requests, but we don't have these tokens in NMTokenCache. > Accordingly we get "No NMToken sent for node" exception. > I have found that this issue appears after changes from the > https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed > > I tried to do the same scenario without the commit and application completed > successfully after RMAppMaster recovery -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892416#comment-15892416 ] Jason Lowe commented on MAPREDUCE-6834: --- Ah, comment race! ;-) [~abalitsky1] so if I understand correctly, you're saying that the patch in this JIRA does _not_ fix the issue? I'm trying to resolve that with [this comment|https://issues.apache.org/jira/browse/MAPREDUCE-6834?focusedCommentId=15770392=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15770392]. If you agree that the patch here isn't appropriate, then I agree we should just duplicate this to YARN-3112. > MR application fails with "No NMToken sent" exception after MRAppMaster > recovery > > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn >Affects Versions: 2.7.0 > Environment: Centos 7 >Reporter: Aleksandr Balitsky >Assignee: Aleksandr Balitsky >Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt > and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application > fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container > launch failed for container_1482408247195_0002_02_11 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for node1:43037 > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > *Problem*: > When RMCommunicator sends "registerApplicationMaster" request to RM, RM > generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted > to RMCommunicator in RegisterApplicationMasterResponse > (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in > RMCommunicator.register method. RM don't transmit tese tokens again for other > allocated requests, but we don't have these tokens in NMTokenCache. > Accordingly we get "No NMToken sent for node" exception. > I have found that this issue appears after changes from the > https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed > > I tried to do the same scenario without the commit and application completed > successfully after RMAppMaster recovery -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892400#comment-15892400 ] Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:27 PM: --- Hi [~haibochen], [~jlowe] Sorry for late reply. {quote} Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on the application submission context is true. Last I checked the MapReduce client (YARNRunner) wasn't specifying that when the application is submitted to YARN. {quote} Actually you are right. I did not consider that MR doesn't support AM work-preserving restart and currently I see that my first patch isn't good solution for this problem. Thanks for the review! {quote} Aleksandr Balitsky, which scheduler were you running? {quote} I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will check it with another schedulers. {quote} We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here {quote} I totally agree with you that we have not made changes to preserve containers in MR. But the solution that you mentioned contradicts YARN design: {quote} As for network optimization, NMTokens are not sent to the ApplicationMasters for each and every allocated container, but only for the first time or if NMTokens have to be invalidated due to the rollover of the underlying master key {quote} That's so true, it is possible that a null NMToken is sent to MR. NMTokens sends only after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not necessary to pass NM tokens during each allocation interaction. So, it's not the best decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature and new NM Tokens will be generated (instead of using instance from cache) during each allocation response. IMHO, we shouldn't do this, because it's not the fix for root cause. It looks like workaround. was (Author: abalitsky1): Hi [~haibochen], [~jlowe] Sorry for late reply. {quote} Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on the application submission context is true. Last I checked the MapReduce client (YARNRunner) wasn't specifying that when the application is submitted to YARN. {quote} Actually you are right. I did not consider that MR doesn't support AM work-preserving restart and currently I see that my first patch isn't good solution for this problem. Thanks for the review! {quote} Aleksandr Balitsky, which scheduler were you running? {quote} I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will check it with another schedulers. {quote} We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here {quote} I totally agree with you that we have not made changes to reserve containers in MR. But the solution that you mentioned contradicts YARN design: {quote} As for network optimization, NMTokens are not sent to the ApplicationMasters for each and every allocated container, but only for the first time or if NMTokens have to be invalidated due to the rollover of the underlying master key {quote} That's so true, it is possible that a null NMToken is sent to MR. NMTokens sends only after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not necessary to pass NM tokens during each allocation interaction. So, it's not the best decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature and new NM
[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892407#comment-15892407 ] Jason Lowe commented on MAPREDUCE-6834: --- Based the patch which claims to fix the problem, I would argue it is not a duplicate. The patch is all about transferring containers from the previous attempt in the registration response, but that is not filled in unless the application was submitted with preservation of containers across application attempts. MapReduce does not do this, therefore I don't see how this patch helps the problem unless MapReduce was patched to do so. I agree the symptom is similar to YARN-3112, but I doubt a fix for YARN-3112 will address [~abalitsky1]'s issue if the original patch in this JIRA also corrected it. > MR application fails with "No NMToken sent" exception after MRAppMaster > recovery > > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn >Affects Versions: 2.7.0 > Environment: Centos 7 >Reporter: Aleksandr Balitsky >Assignee: Aleksandr Balitsky >Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt > and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application > fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container > launch failed for container_1482408247195_0002_02_11 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for node1:43037 > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > *Problem*: > When RMCommunicator sends "registerApplicationMaster" request to RM, RM > generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted > to RMCommunicator in RegisterApplicationMasterResponse > (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in > RMCommunicator.register method. RM don't transmit tese tokens again for other > allocated requests, but we don't have these tokens in NMTokenCache. > Accordingly we get "No NMToken sent for node" exception. > I have found that this issue appears after changes from the > https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed > > I tried to do the same scenario without the commit and application completed > successfully after RMAppMaster recovery -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892400#comment-15892400 ] Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:20 PM: --- Hi [~haibochen], [~jlowe] Sorry for late reply. {quote} Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on the application submission context is true. Last I checked the MapReduce client (YARNRunner) wasn't specifying that when the application is submitted to YARN. {quote} Actually you are right. I did not consider that MR doesn't support AM work-preserving restart and currently I see that my first patch isn't good solution for this problem. Thanks for the review! {quote} Aleksandr Balitsky, which scheduler were you running? {quote} I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will check it with another schedulers. {quote} We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here {quote} I totally agree with you that we have not made changes to reserve containers in MR. But the solution that you mentioned contradicts YARN design: {quote} As for network optimization, NMTokens are not sent to the ApplicationMasters for each and every allocated container, but only for the first time or if NMTokens have to be invalidated due to the rollover of the underlying master key {quote} That's so true, it is possible that a null NMToken is sent to MR. NMTokens sends only after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not necessary to pass NM tokens during each allocation interaction. So, it's not the best decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature and new NM Tokens will be generated (instead of using instance from cache) during each allocation response. IMHO, we shouldn't do this, because it's not the fix for root cause. It looks like workaround. was (Author: abalitsky1): Hi [~haibochen], [~jlowe] Sorry for late reply. {quote} Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on the application submission context is true. Last I checked the MapReduce client (YARNRunner) wasn't specifying that when the application is submitted to YARN. {quote} Actually you are right. I did not consider that MR doesn't support AM work-preserving restart and currently I see that my first patch isn't good solution for this problem. Thanks for the review! {quote} Aleksandr Balitsky, which scheduler were you running? {quote} I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will check it with another schedulers. {quote} We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here {quote} I totally agree with you that we have not made changes to reserve containers in MR. But the solution that you mentioned contradicts YARN design: {quote} As for network optimization, NMTokens are not sent to the ApplicationMasters for each and every allocated container, but only for the first time or if NMTokens have to be invalidated due to the rollover of the underlying master key {quote} That's so true that it is possible that a null NMToken is sent to MR. NMTokens sends only after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not necessary to pass NM tokens during each allocation interaction. So, it's not the best decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature and new NM
[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery
[ https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892400#comment-15892400 ] Aleksandr Balitsky commented on MAPREDUCE-6834: --- Hi [~haibochen], [~jlowe] Sorry for late reply. {quote} Is this a scenario where somehow the MRAppMaster is asking to preserve containers across app attempts? I ask because ApplicationMasterService normally does not call setNMTokensFromPreviousAttempts on RegisterApplicationMasterResponse unless getKeepContainersAcrossApplicationAttempts on the application submission context is true. Last I checked the MapReduce client (YARNRunner) wasn't specifying that when the application is submitted to YARN. {quote} Actually you are right. I did not consider that MR doesn't support AM work-preserving restart and currently I see that my first patch isn't good solution for this problem. Thanks for the review! {quote} Aleksandr Balitsky, which scheduler were you running? {quote} I'm running Fair Scheduler. I don't think that this issue depends on a scheduler, but I will check it with another schedulers. {quote} We have not made changes to preserve containers in MR. Chasing the code in more details, I came to a similar conclusion as https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. Given the code today, it is possible that a null NMToken is sent to MR, which contracts with the javadoc in SchedulerApplicationAttempt.java here {quote} I totally agree with you that we have not made changes to reserve containers in MR. But the solution that you mentioned contradicts YARN design: {quote} As for network optimization, NMTokens are not sent to the ApplicationMasters for each and every allocated container, but only for the first time or if NMTokens have to be invalidated due to the rollover of the underlying master key {quote} That's so true that it is possible that a null NMToken is sent to MR. NMTokens sends only after first creation, it's designed feature. Then it saves to NMTokenCache from AM side. It's not necessary to pass NM tokens during each allocation interaction. So, it's not the best decision to clear NMTokenSecretManager cache during each allocation, because it disables "cache" feature and new NM Tokens will be generated (instead of using instance from cache) during each allocation response. IMHO, we shouldn't do this, because it's not the fix for root cause. It looks like workaround. > MR application fails with "No NMToken sent" exception after MRAppMaster > recovery > > > Key: MAPREDUCE-6834 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: resourcemanager, yarn >Affects Versions: 2.7.0 > Environment: Centos 7 >Reporter: Aleksandr Balitsky >Assignee: Aleksandr Balitsky >Priority: Critical > Attachments: YARN-6019.001.patch > > > *Steps to reproduce:* > 1) Submit MR application (for example PI app with 50 containers) > 2) Find MRAppMaster process id for the application > 3) Kill MRAppMaster by kill -9 command > *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt > and application finish correctly > *Actually:* After launching new MRAppMaster and MRAppAttempt the application > fails with the following exception: > {noformat} > 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container > launch failed for container_1482408247195_0002_02_11 : > org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent > for node1:43037 > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244) > at > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes an object name
[ https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Vernik updated MAPREDUCE-6854: -- Description: Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 into swift://container.mil01/data/a.txt The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID and then rename them to the final names. Such flow is problematic in the object stores, where it usually advised not to create, delete and create object under the same name. This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name, For example "/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt" or "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" was: Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 into swift://container.mil01/data/a.txt The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID and then rename them to the final names. Such flow is problematic in the object stores, where it usually advised not to create, delete and create object under the same name. This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name, For example "/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt" or "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" > Each map task should create a unique temporary name that includes an object > name > > > Key: MAPREDUCE-6854 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: Gil Vernik > > Consider an example: a local file "/data/a.txt" need to be copied into > swift://container.service/data/a.txt > The way distcp works is that first it will upload "/data/a.txt" into > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > Upon completion distcp will move > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > into swift://container.mil01/data/a.txt > > The temporary file naming convention assumes that each map task will > sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID > and then rename them to the final names. Such flow is problematic in the > object stores, where it usually advised not to create, delete and create > object under the same name. > This JIRA propose to add a configuration key indicating that temporary > objects will also include object name as part of their temporary file name, > For example > "/data/a.txt" will be uploaded into > "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt" > or > "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes an object name
[ https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Vernik updated MAPREDUCE-6854: -- Summary: Each map task should create a unique temporary name that includes an object name (was: Each map task should create a unique temporary name that includes object name) > Each map task should create a unique temporary name that includes an object > name > > > Key: MAPREDUCE-6854 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: Gil Vernik > > Consider an example: a local file "/data/a.txt" need to be copied into > swift://container.service/data/a.txt > The way distcp works is that first it will upload "/data/a.txt" into > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > Upon completion distcp will move > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > into swift://container.mil01/data/a.txt > The temporary file naming convention assumes that each map task will > sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID > and then rename them to the final names. Such flow is problematic in the > object stores, where it usually advised not to create, delete and create > object under the same name. > This JIRA propose to add a configuration key indicating that temporary > objects will also include object name as part of their temporary file name, > For example > "/data/a.txt" will be uploaded into > "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt" > or > "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes object name
[ https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Vernik updated MAPREDUCE-6854: -- Description: Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 into swift://container.mil01/data/a.txt The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID and then rename them to the final names. Such flow is problematic in the object stores, where it usually advised not to create, delete and create object under the same name. This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name, For example "/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt" or "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" was: Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 into swift://container.mil01/data/a.txt The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID and then rename them to the final names. Such flow is problematic in the object stores, where it usually advised not to create, delete and create object under the same name. This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name, For example "/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt" or "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" > Each map task should create a unique temporary name that includes object name > - > > Key: MAPREDUCE-6854 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: Gil Vernik > > Consider an example: a local file "/data/a.txt" need to be copied into > swift://container.service/data/a.txt > The way distcp works is that first it will upload "/data/a.txt" into > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > Upon completion distcp will move > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > into swift://container.mil01/data/a.txt > The temporary file naming convention assumes that each map task will > sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID > and then rename them to the final names. Such flow is problematic in the > object stores, where it usually advised not to create, delete and create > object under the same name. > This JIRA propose to add a configuration key indicating that temporary > objects will also include object name as part of their temporary file name, > For example > "/data/a.txt" will be uploaded into > "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt" > or > "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes object name
[ https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Vernik updated MAPREDUCE-6854: -- Description: Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 into swift://container.mil01/data/a.txt The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID and then rename them to the final names. Such flow is problematic in the object stores, where it usually advised not to create, delete and create object under the same name. This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name, For example "/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt" or "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" was: Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data3/.distcp.tmp.attempt_local2036034928_0001_m_00_0 Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 into swift://container.mil01/data/a.txt The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID and then rename them to the final names. Such flow is problematic in the object stores, where it usually advised not to create, delete and create object under the same name. This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name, For example "/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt" or "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" > Each map task should create a unique temporary name that includes object name > - > > Key: MAPREDUCE-6854 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distcp >Reporter: Gil Vernik > > Consider an example: a local file "/data/a.txt" need to be copied into > swift://container.service/data/a.txt > The way distcp works is that first it will upload "/data/a.txt" into > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > Upon completion distcp will move > swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 > into swift://container.mil01/data/a.txt > The temporary file naming convention assumes that each map task will > sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID > and then rename them to the final names. Such flow is problematic in the > object stores, where it usually advised not to create, delete and create > object under the same name. > This JIRA propose to add a configuration key indicating that temporary > objects will also include object name as part of their temporary file name, > For example > "/data/a.txt" will be uploaded into > "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt" > or > "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes object name
Gil Vernik created MAPREDUCE-6854: - Summary: Each map task should create a unique temporary name that includes object name Key: MAPREDUCE-6854 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854 Project: Hadoop Map/Reduce Issue Type: Improvement Components: distcp Reporter: Gil Vernik Consider an example: a local file "/data/a.txt" need to be copied into swift://container.service/data/a.txt The way distcp works is that first it will upload "/data/a.txt" into swift://container.mil01/data3/.distcp.tmp.attempt_local2036034928_0001_m_00_0 Upon completion distcp will move swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0 into swift://container.mil01/data/a.txt The temporary file naming convention assumes that each map task will sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID and then rename them to the final names. Such flow is problematic in the object stores, where it usually advised not to create, delete and create object under the same name. This JIRA propose to add a configuration key indicating that temporary objects will also include object name as part of their temporary file name, For example "/data/a.txt" will be uploaded into "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt" or "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org