[jira] [Created] (MAPREDUCE-6855) Specify charset when create String in CredentialsTestJob

2017-03-02 Thread Akira Ajisaka (JIRA)
Akira Ajisaka created MAPREDUCE-6855:


 Summary: Specify charset when create String in CredentialsTestJob
 Key: MAPREDUCE-6855
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6855
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Akira Ajisaka
Priority: Minor


{code}
  String secretValueStr = new String (secretValue);
{code}
should be
{code}
  String secretValueStr = new String(secretValue, StandardCharsets.UTF_8);
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6753) Variable in byte printed directly in mapreduce client

2017-03-02 Thread Akira Ajisaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated MAPREDUCE-6753:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0-alpha3
   2.8.1
   2.9.0
   Status: Resolved  (was: Patch Available)

Committed this to trunk, branch-2, and branch-2.8. Thanks all who contributed 
to this issue.

> Variable in byte printed directly in mapreduce client
> -
>
> Key: MAPREDUCE-6753
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6753
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Nemo Chen
>Assignee: Kai Sasaki
>  Labels: easyfix, easytest
> Fix For: 2.9.0, 2.8.1, 3.0.0-alpha3
>
> Attachments: MAPREDUCE-6753.01.patch, MAPREDUCE-6753.02.patch, 
> MAPREDUCE-6753.03.patch
>
>
> Similar to the fix for HBASE-623, in file:
> hadoop-rel-release-2.7.2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/CredentialsTestJob.java
> in line 61, the system out print a byte variable secretValue.
> {code}
> System.out.println(secretValue);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6753) Variable in byte printed directly in mapreduce client

2017-03-02 Thread Akira Ajisaka (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893835#comment-15893835
 ] 

Akira Ajisaka commented on MAPREDUCE-6753:
--

+1, thanks [~lewuathe] and [~haibochen].

> Variable in byte printed directly in mapreduce client
> -
>
> Key: MAPREDUCE-6753
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6753
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Nemo Chen
>Assignee: Kai Sasaki
>  Labels: easyfix, easytest
> Attachments: MAPREDUCE-6753.01.patch, MAPREDUCE-6753.02.patch, 
> MAPREDUCE-6753.03.patch
>
>
> Similar to the fix for HBASE-623, in file:
> hadoop-rel-release-2.7.2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/security/CredentialsTestJob.java
> in line 61, the system out print a byte variable secretValue.
> {code}
> System.out.println(secretValue);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition

2017-03-02 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6852:
--
Fix Version/s: 3.0.0-alpha3

> Job#updateStatus() failed with NPE due to race condition
> 
>
> Key: MAPREDUCE-6852
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch
>
>
> Like MAPREDUCE-6762, we found this issue in a cluster where Pig query 
> occasionally failed on NPE - "Pig uses JobControl API to track MR job status, 
> but sometimes Job History Server failed to flush job meta files to HDFS which 
> caused the status update failed." Beside NPE in 
> o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the 
> exception is as following:
> {noformat}
> Caused by: java.lang.NullPointerException
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833)
>   at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320)
>   at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604)
> {noformat}
> We found state here is null. However, we already check the job state to be 
> RUNNING as code below:
> {noformat}
>   public boolean isComplete() throws IOException {
> ensureState(JobState.RUNNING);
> updateStatus();
> return status.isJobComplete();
>   }
> {noformat}
> The only possible reason here is two threads are calling here for the same 
> time: ensure state first, then one thread update the state to null while the 
> other thread hit NPE issue here.
> We should fix this NPE exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition

2017-03-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892827#comment-15892827
 ] 

Junping Du commented on MAPREDUCE-6852:
---

Thanks Jian for review and commit!

> Job#updateStatus() failed with NPE due to race condition
> 
>
> Key: MAPREDUCE-6852
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch
>
>
> Like MAPREDUCE-6762, we found this issue in a cluster where Pig query 
> occasionally failed on NPE - "Pig uses JobControl API to track MR job status, 
> but sometimes Job History Server failed to flush job meta files to HDFS which 
> caused the status update failed." Beside NPE in 
> o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the 
> exception is as following:
> {noformat}
> Caused by: java.lang.NullPointerException
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833)
>   at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320)
>   at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604)
> {noformat}
> We found state here is null. However, we already check the job state to be 
> RUNNING as code below:
> {noformat}
>   public boolean isComplete() throws IOException {
> ensureState(JobState.RUNNING);
> updateStatus();
> return status.isJobComplete();
>   }
> {noformat}
> The only possible reason here is two threads are calling here for the same 
> time: ensure state first, then one thread update the state to null while the 
> other thread hit NPE issue here.
> We should fix this NPE exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition

2017-03-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892808#comment-15892808
 ] 

Hudson commented on MAPREDUCE-6852:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11331 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/11331/])
MAPREDUCE-6852. Job#updateStatus() failed with NPE due to race (jianhe: rev 
747bafaf969857b66233a8b4660590bdd712ed7d)
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/Job.java


> Job#updateStatus() failed with NPE due to race condition
> 
>
> Key: MAPREDUCE-6852
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 2.9.0
>
> Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch
>
>
> Like MAPREDUCE-6762, we found this issue in a cluster where Pig query 
> occasionally failed on NPE - "Pig uses JobControl API to track MR job status, 
> but sometimes Job History Server failed to flush job meta files to HDFS which 
> caused the status update failed." Beside NPE in 
> o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the 
> exception is as following:
> {noformat}
> Caused by: java.lang.NullPointerException
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833)
>   at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320)
>   at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604)
> {noformat}
> We found state here is null. However, we already check the job state to be 
> RUNNING as code below:
> {noformat}
>   public boolean isComplete() throws IOException {
> ensureState(JobState.RUNNING);
> updateStatus();
> return status.isJobComplete();
>   }
> {noformat}
> The only possible reason here is two threads are calling here for the same 
> time: ensure state first, then one thread update the state to null while the 
> other thread hit NPE issue here.
> We should fix this NPE exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6852) Job#updateStatus() failed with NPE due to race condition

2017-03-02 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated MAPREDUCE-6852:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.9.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2, thanks Junping !

> Job#updateStatus() failed with NPE due to race condition
> 
>
> Key: MAPREDUCE-6852
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6852
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 2.9.0
>
> Attachments: MAPREDUCE-6852.patch, MAPREDUCE-6852-v2.patch
>
>
> Like MAPREDUCE-6762, we found this issue in a cluster where Pig query 
> occasionally failed on NPE - "Pig uses JobControl API to track MR job status, 
> but sometimes Job History Server failed to flush job meta files to HDFS which 
> caused the status update failed." Beside NPE in 
> o.a.h.mapreduce.Job.getJobName, we also get NPE in Job.updateStatus() and the 
> exception is as following:
> {noformat}
> Caused by: java.lang.NullPointerException
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)
>   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:320)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1833)
>   at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320)
>   at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:604)
> {noformat}
> We found state here is null. However, we already check the job state to be 
> RUNNING as code below:
> {noformat}
>   public boolean isComplete() throws IOException {
> ensureState(JobState.RUNNING);
> updateStatus();
> return status.isJobComplete();
>   }
> {noformat}
> The only possible reason here is two threads are calling here for the same 
> time: ensure state first, then one thread update the state to null while the 
> other thread hit NPE issue here.
> We should fix this NPE exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892631#comment-15892631
 ] 

Haibo Chen commented on MAPREDUCE-6834:
---

bq. IMHO, we shouldn't do this, because it's not the fix for root cause. It 
looks like workaround
Sorry for the confusion. I should have been more specific, I meant to say that 
the root cause analysis looks correct to me. I agree with you that we should 
not follow the proposed fix there.


> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> 
>
> Key: MAPREDUCE-6834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 2.7.0
> Environment: Centos 7
>Reporter: Aleksandr Balitsky
>Assignee: Aleksandr Balitsky
>Priority: Critical
> Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_11 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Aleksandr Balitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892455#comment-15892455
 ] 

Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:59 PM:
---

[~jlowe], I'm not sure that YARN-3112 the same issue that I reported. 

>From YARN-3112 description:
{code}
New AM has inherited the old tokens from previous AM according to my 
configuration (keepContainers=true), so the token for new containers are 
replaced by the old one in the NMTokenCache.
{code}

I have not used "keep-containers-across-application-attempts" feature, and I 
definitely didn't faced with problem when the token for new containers are 
replaced by the old one in the NMTokenCache (my debug can confirm it), because 
AM was restarted and all old tokens was removed. 




was (Author: abalitsky1):
[~jlowe], I'm not sure that YARN-3112 the same issue that I reported. 

>From YARN-3112 description:
{code}
New AM has inherited the old tokens from previous AM according to my 
configuration (keepContainers=true), so the token for new containers are 
replaced by the old one in the NMTokenCache.
{code}

I have not used "keep-containers-across-application-attempts" feature, and I 
definitely didn't faced with problem when the token for new containers are 
replaced by the old one in the NMTokenCache (my debug can confirm it), cause AM 
was restarted and all old tokens was removed. 



> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> 
>
> Key: MAPREDUCE-6834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 2.7.0
> Environment: Centos 7
>Reporter: Aleksandr Balitsky
>Assignee: Aleksandr Balitsky
>Priority: Critical
> Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_11 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, 

[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Aleksandr Balitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892455#comment-15892455
 ] 

Aleksandr Balitsky commented on MAPREDUCE-6834:
---

[~jlowe], I'm not sure that YARN-3112 the same issue that I reported. 

>From YARN-3112 description:
{code}
New AM has inherited the old tokens from previous AM according to my 
configuration (keepContainers=true), so the token for new containers are 
replaced by the old one in the NMTokenCache.
{code}

I have not used "keep-containers-across-application-attempts" feature, and I 
definitely didn't faced with problem when the token for new containers are 
replaced by the old one in the NMTokenCache (my debug can confirm it), cause AM 
was restarted and all old tokens was removed. 



> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> 
>
> Key: MAPREDUCE-6834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 2.7.0
> Environment: Centos 7
>Reporter: Aleksandr Balitsky
>Assignee: Aleksandr Balitsky
>Priority: Critical
> Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_11 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Aleksandr Balitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892442#comment-15892442
 ] 

Aleksandr Balitsky commented on MAPREDUCE-6834:
---

[~jlowe], yep, 001 patch isn't good from design point of view. 
I'm going to investigate the code again to find the root cause. 

> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> 
>
> Key: MAPREDUCE-6834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 2.7.0
> Environment: Centos 7
>Reporter: Aleksandr Balitsky
>Assignee: Aleksandr Balitsky
>Priority: Critical
> Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_11 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892416#comment-15892416
 ] 

Jason Lowe commented on MAPREDUCE-6834:
---

Ah, comment race!  ;-)

[~abalitsky1] so if I understand correctly, you're saying that the patch in 
this JIRA does _not_ fix the issue?  I'm trying to resolve that with [this 
comment|https://issues.apache.org/jira/browse/MAPREDUCE-6834?focusedCommentId=15770392=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15770392].
  If you agree that the patch here isn't appropriate, then I agree we should 
just duplicate this to YARN-3112.

> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> 
>
> Key: MAPREDUCE-6834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 2.7.0
> Environment: Centos 7
>Reporter: Aleksandr Balitsky
>Assignee: Aleksandr Balitsky
>Priority: Critical
> Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_11 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Aleksandr Balitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892400#comment-15892400
 ] 

Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:27 PM:
---

Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve 
containers across app attempts? I ask because ApplicationMasterService normally 
does not call setNMTokensFromPreviousAttempts on 
RegisterApplicationMasterResponse unless 
getKeepContainersAcrossApplicationAttempts on the application submission 
context is true. Last I checked the MapReduce client (YARNRunner) wasn't 
specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM 
work-preserving restart and currently I see that my first patch isn't good 
solution for this problem. Thanks for the review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a 
scheduler, but I will check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more 
details, I came to a similar conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. 
Given the code today, it is possible that a null NMToken is sent to MR, which 
contracts with the javadoc in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to preserve containers 
in MR. But the solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters 
for each and every allocated container, but only for the first time or if 
NMTokens have to be invalidated due to the rollover of the underlying master key
{quote}

That's so true, it is possible that a null NMToken is sent to MR. NMTokens 
sends only after first creation, it's designed feature. Then it saves to 
NMTokenCache from AM side. It's not necessary to pass NM tokens during each 
allocation interaction. So, it's not the best decision to clear 
NMTokenSecretManager cache during each allocation, because it disables "cache" 
feature and new NM Tokens will be generated (instead of using instance from 
cache) during each allocation response. IMHO, we shouldn't do this, because 
it's not the fix for root cause. It looks like workaround. 


was (Author: abalitsky1):
Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve 
containers across app attempts? I ask because ApplicationMasterService normally 
does not call setNMTokensFromPreviousAttempts on 
RegisterApplicationMasterResponse unless 
getKeepContainersAcrossApplicationAttempts on the application submission 
context is true. Last I checked the MapReduce client (YARNRunner) wasn't 
specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM 
work-preserving restart and currently I see that my first patch isn't good 
solution for this problem. Thanks for the review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a 
scheduler, but I will check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more 
details, I came to a similar conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. 
Given the code today, it is possible that a null NMToken is sent to MR, which 
contracts with the javadoc in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to reserve containers in 
MR. But the solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters 
for each and every allocated container, but only for the first time or if 
NMTokens have to be invalidated due to the rollover of the underlying master key
{quote}

That's so true, it is possible that a null NMToken is sent to MR. NMTokens 
sends only after first creation, it's designed feature. Then it saves to 
NMTokenCache from AM side. It's not necessary to pass NM tokens during each 
allocation interaction. So, it's not the best decision to clear 
NMTokenSecretManager cache during each allocation, because it disables "cache" 
feature and new NM 

[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892407#comment-15892407
 ] 

Jason Lowe commented on MAPREDUCE-6834:
---

Based the patch which claims to fix the problem, I would argue it is not a 
duplicate.  The patch is all about transferring containers from the previous 
attempt in the registration response, but that is not filled in unless the 
application was submitted with preservation of containers across application 
attempts.  MapReduce does not do this, therefore I don't see how this patch 
helps the problem unless MapReduce was patched to do so.  I agree the symptom 
is similar to YARN-3112, but I doubt a fix for YARN-3112 will address 
[~abalitsky1]'s issue if the original patch in this JIRA also corrected it.

> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> 
>
> Key: MAPREDUCE-6834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 2.7.0
> Environment: Centos 7
>Reporter: Aleksandr Balitsky
>Assignee: Aleksandr Balitsky
>Priority: Critical
> Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_11 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> *Problem*:
> When RMCommunicator sends "registerApplicationMaster" request to RM, RM 
> generates NMTokens for new RMAppAttempt. Those new NMTokens are transmitted 
> to RMCommunicator in RegisterApplicationMasterResponse  
> (getNMTokensFromPreviousAttempts method). But we don't handle these tokens in 
> RMCommunicator.register method. RM don't transmit tese tokens again for other 
> allocated requests, but we don't have these tokens in NMTokenCache. 
> Accordingly we get "No NMToken sent for node" exception.
> I have found that this issue appears after changes from the 
> https://github.com/apache/hadoop/commit/9b272ccae78918e7d756d84920a9322187d61eed
>  
> I tried to do the same scenario without the commit and application completed 
> successfully after RMAppMaster recovery



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Aleksandr Balitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892400#comment-15892400
 ] 

Aleksandr Balitsky edited comment on MAPREDUCE-6834 at 3/2/17 3:20 PM:
---

Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve 
containers across app attempts? I ask because ApplicationMasterService normally 
does not call setNMTokensFromPreviousAttempts on 
RegisterApplicationMasterResponse unless 
getKeepContainersAcrossApplicationAttempts on the application submission 
context is true. Last I checked the MapReduce client (YARNRunner) wasn't 
specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM 
work-preserving restart and currently I see that my first patch isn't good 
solution for this problem. Thanks for the review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a 
scheduler, but I will check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more 
details, I came to a similar conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. 
Given the code today, it is possible that a null NMToken is sent to MR, which 
contracts with the javadoc in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to reserve containers in 
MR. But the solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters 
for each and every allocated container, but only for the first time or if 
NMTokens have to be invalidated due to the rollover of the underlying master key
{quote}

That's so true, it is possible that a null NMToken is sent to MR. NMTokens 
sends only after first creation, it's designed feature. Then it saves to 
NMTokenCache from AM side. It's not necessary to pass NM tokens during each 
allocation interaction. So, it's not the best decision to clear 
NMTokenSecretManager cache during each allocation, because it disables "cache" 
feature and new NM Tokens will be generated (instead of using instance from 
cache) during each allocation response. IMHO, we shouldn't do this, because 
it's not the fix for root cause. It looks like workaround. 


was (Author: abalitsky1):
Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve 
containers across app attempts? I ask because ApplicationMasterService normally 
does not call setNMTokensFromPreviousAttempts on 
RegisterApplicationMasterResponse unless 
getKeepContainersAcrossApplicationAttempts on the application submission 
context is true. Last I checked the MapReduce client (YARNRunner) wasn't 
specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM 
work-preserving restart and currently I see that my first patch isn't good 
solution for this problem. Thanks for the review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a 
scheduler, but I will check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more 
details, I came to a similar conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. 
Given the code today, it is possible that a null NMToken is sent to MR, which 
contracts with the javadoc in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to reserve containers in 
MR. But the solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters 
for each and every allocated container, but only for the first time or if 
NMTokens have to be invalidated due to the rollover of the underlying master key
{quote}

That's so true that it is possible that a null NMToken is sent to MR. NMTokens 
sends only after first creation, it's designed feature. Then it saves to 
NMTokenCache from AM side. It's not necessary to pass NM tokens during each 
allocation interaction. So, it's not the best decision to clear 
NMTokenSecretManager cache during each allocation, because it disables "cache" 
feature and new NM 

[jira] [Commented] (MAPREDUCE-6834) MR application fails with "No NMToken sent" exception after MRAppMaster recovery

2017-03-02 Thread Aleksandr Balitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892400#comment-15892400
 ] 

Aleksandr Balitsky commented on MAPREDUCE-6834:
---

Hi [~haibochen], [~jlowe]
Sorry for late reply. 

{quote}
Is this a scenario where somehow the MRAppMaster is asking to preserve 
containers across app attempts? I ask because ApplicationMasterService normally 
does not call setNMTokensFromPreviousAttempts on 
RegisterApplicationMasterResponse unless 
getKeepContainersAcrossApplicationAttempts on the application submission 
context is true. Last I checked the MapReduce client (YARNRunner) wasn't 
specifying that when the application is submitted to YARN.
{quote}

Actually you are right. I did not consider that MR doesn't support AM 
work-preserving restart and currently I see that my first patch isn't good 
solution for this problem. Thanks for the review!

{quote}
Aleksandr Balitsky, which scheduler were you running?
{quote}

I'm running Fair Scheduler. I don't think that this issue depends on a 
scheduler, but I will check it with another schedulers. 

{quote}
We have not made changes to preserve containers in MR. Chasing the code in more 
details, I came to a similar conclusion as 
https://issues.apache.org/jira/browse/YARN-3112?focusedCommentId=14299003=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299003
 MR relies on YARN RM to get the NMtokens needed to launch containers with NMs. 
Given the code today, it is possible that a null NMToken is sent to MR, which 
contracts with the javadoc in SchedulerApplicationAttempt.java here
{quote}

I totally agree with you that we have not made changes to reserve containers in 
MR. But the solution that you mentioned contradicts YARN design:
{quote}
As for network optimization, NMTokens are not sent to the ApplicationMasters 
for each and every allocated container, but only for the first time or if 
NMTokens have to be invalidated due to the rollover of the underlying master key
{quote}

That's so true that it is possible that a null NMToken is sent to MR. NMTokens 
sends only after first creation, it's designed feature. Then it saves to 
NMTokenCache from AM side. It's not necessary to pass NM tokens during each 
allocation interaction. So, it's not the best decision to clear 
NMTokenSecretManager cache during each allocation, because it disables "cache" 
feature and new NM Tokens will be generated (instead of using instance from 
cache) during each allocation response. IMHO, we shouldn't do this, because 
it's not the fix for root cause. It looks like workaround. 

> MR application fails with "No NMToken sent" exception after MRAppMaster 
> recovery
> 
>
> Key: MAPREDUCE-6834
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6834
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: resourcemanager, yarn
>Affects Versions: 2.7.0
> Environment: Centos 7
>Reporter: Aleksandr Balitsky
>Assignee: Aleksandr Balitsky
>Priority: Critical
> Attachments: YARN-6019.001.patch
>
>
> *Steps to reproduce:*
> 1) Submit MR application (for example PI app with 50 containers)
> 2) Find MRAppMaster process id for the application 
> 3) Kill MRAppMaster by kill -9 command
> *Expected:* ResourceManager launch new MRAppMaster container and MRAppAttempt 
> and application finish correctly
> *Actually:* After launching new MRAppMaster and MRAppAttempt the application 
> fails with the following exception:
> {noformat}
> 2016-12-22 23:17:53,929 ERROR [ContainerLauncher #9] 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container 
> launch failed for container_1482408247195_0002_02_11 : 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent 
> for node1:43037
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:254)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.(ContainerManagementProtocolProxy.java:244)
>   at 
> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:129)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:395)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>   at 
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:361)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes an object name

2017-03-02 Thread Gil Vernik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Vernik updated MAPREDUCE-6854:
--
Description: 
Consider an example: a local file "/data/a.txt"  need to be copied into 
swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into 
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0

Upon completion distcp will move   
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
 into swift://container.mil01/data/a.txt

The temporary file naming convention assumes that each map task will 
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Such flow is problematic in the 
object stores, where it usually advised not to create, delete and create object 
under the same name. 

This JIRA propose to add a configuration key indicating that temporary objects 
will also include object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt"
 or 
"swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"

  was:
Consider an example: a local file "/data/a.txt"  need to be copied into 
swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into 
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0

Upon completion distcp will move   
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
 into swift://container.mil01/data/a.txt

The temporary file naming convention assumes that each map task will 
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Such flow is problematic in the 
object stores, where it usually advised not to create, delete and create object 
under the same name. 

This JIRA propose to add a configuration key indicating that temporary objects 
will also include object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt"
 or 
"swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"


> Each map task should create a unique temporary name that includes an object 
> name
> 
>
> Key: MAPREDUCE-6854
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Reporter: Gil Vernik
>
> Consider an example: a local file "/data/a.txt"  need to be copied into 
> swift://container.service/data/a.txt
> The way distcp works is that first it will upload "/data/a.txt" into 
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
> Upon completion distcp will move   
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
>  into swift://container.mil01/data/a.txt
> 
> The temporary file naming convention assumes that each map task will 
> sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
> and then rename them to the final names.  Such flow is problematic in the 
> object stores, where it usually advised not to create, delete and create 
> object under the same name. 
> This JIRA propose to add a configuration key indicating that temporary 
> objects will also include object name as part of their temporary file name,
> For example
> "/data/a.txt" will be uploaded into 
> "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt"
>  or 
> "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes an object name

2017-03-02 Thread Gil Vernik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Vernik updated MAPREDUCE-6854:
--
Summary: Each map task should create a unique temporary name that includes 
an object name  (was: Each map task should create a unique temporary name that 
includes object name)

> Each map task should create a unique temporary name that includes an object 
> name
> 
>
> Key: MAPREDUCE-6854
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Reporter: Gil Vernik
>
> Consider an example: a local file "/data/a.txt"  need to be copied into 
> swift://container.service/data/a.txt
> The way distcp works is that first it will upload "/data/a.txt" into 
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
> Upon completion distcp will move   
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
>  into swift://container.mil01/data/a.txt
> The temporary file naming convention assumes that each map task will 
> sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
> and then rename them to the final names.  Such flow is problematic in the 
> object stores, where it usually advised not to create, delete and create 
> object under the same name. 
> This JIRA propose to add a configuration key indicating that temporary 
> objects will also include object name as part of their temporary file name,
> For example
> "/data/a.txt" will be uploaded into 
> "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt"
>  or 
> "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes object name

2017-03-02 Thread Gil Vernik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Vernik updated MAPREDUCE-6854:
--
Description: 
Consider an example: a local file "/data/a.txt"  need to be copied into 
swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into 
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0

Upon completion distcp will move   
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
 into swift://container.mil01/data/a.txt

The temporary file naming convention assumes that each map task will 
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Such flow is problematic in the 
object stores, where it usually advised not to create, delete and create object 
under the same name. 

This JIRA propose to add a configuration key indicating that temporary objects 
will also include object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt"
 or 
"swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"

  was:
Consider an example: a local file "/data/a.txt"  need to be copied into 
swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into 
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0

Upon completion distcp will move   
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
 into swift://container.mil01/data/a.txt

The temporary file naming convention assumes that each map task will 
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Such flow is problematic in the 
object stores, where it usually advised not to create, delete and create object 
under the same name. 

This JIRA propose to add a configuration key indicating that temporary objects 
will also include object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt"
 or 
"swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"


> Each map task should create a unique temporary name that includes object name
> -
>
> Key: MAPREDUCE-6854
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Reporter: Gil Vernik
>
> Consider an example: a local file "/data/a.txt"  need to be copied into 
> swift://container.service/data/a.txt
> The way distcp works is that first it will upload "/data/a.txt" into 
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
> Upon completion distcp will move   
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
>  into swift://container.mil01/data/a.txt
> The temporary file naming convention assumes that each map task will 
> sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
> and then rename them to the final names.  Such flow is problematic in the 
> object stores, where it usually advised not to create, delete and create 
> object under the same name. 
> This JIRA propose to add a configuration key indicating that temporary 
> objects will also include object name as part of their temporary file name,
> For example
> "/data/a.txt" will be uploaded into 
> "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0/a.txt"
>  or 
> "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes object name

2017-03-02 Thread Gil Vernik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gil Vernik updated MAPREDUCE-6854:
--
Description: 
Consider an example: a local file "/data/a.txt"  need to be copied into 
swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into 
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0

Upon completion distcp will move   
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
 into swift://container.mil01/data/a.txt

The temporary file naming convention assumes that each map task will 
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Such flow is problematic in the 
object stores, where it usually advised not to create, delete and create object 
under the same name. 

This JIRA propose to add a configuration key indicating that temporary objects 
will also include object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt"
 or 
"swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"

  was:
Consider an example: a local file "/data/a.txt"  need to be copied into 
swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into 
swift://container.mil01/data3/.distcp.tmp.attempt_local2036034928_0001_m_00_0

Upon completion distcp will move   
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
 into swift://container.mil01/data/a.txt

The temporary file naming convention assumes that each map task will 
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Such flow is problematic in the 
object stores, where it usually advised not to create, delete and create object 
under the same name. 

This JIRA propose to add a configuration key indicating that temporary objects 
will also include object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt"
 or 
"swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"


> Each map task should create a unique temporary name that includes object name
> -
>
> Key: MAPREDUCE-6854
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Reporter: Gil Vernik
>
> Consider an example: a local file "/data/a.txt"  need to be copied into 
> swift://container.service/data/a.txt
> The way distcp works is that first it will upload "/data/a.txt" into 
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
> Upon completion distcp will move   
> swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
>  into swift://container.mil01/data/a.txt
> The temporary file naming convention assumes that each map task will 
> sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
> and then rename them to the final names.  Such flow is problematic in the 
> object stores, where it usually advised not to create, delete and create 
> object under the same name. 
> This JIRA propose to add a configuration key indicating that temporary 
> objects will also include object name as part of their temporary file name,
> For example
> "/data/a.txt" will be uploaded into 
> "swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt"
>  or 
> "swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6854) Each map task should create a unique temporary name that includes object name

2017-03-02 Thread Gil Vernik (JIRA)
Gil Vernik created MAPREDUCE-6854:
-

 Summary: Each map task should create a unique temporary name that 
includes object name
 Key: MAPREDUCE-6854
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6854
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp
Reporter: Gil Vernik


Consider an example: a local file "/data/a.txt"  need to be copied into 
swift://container.service/data/a.txt

The way distcp works is that first it will upload "/data/a.txt" into 
swift://container.mil01/data3/.distcp.tmp.attempt_local2036034928_0001_m_00_0

Upon completion distcp will move   
swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0
 into swift://container.mil01/data/a.txt

The temporary file naming convention assumes that each map task will 
sequentially create objects as swift://container.mil01/.distcp.tmp.attempt_ID
and then rename them to the final names.  Such flow is problematic in the 
object stores, where it usually advised not to create, delete and create object 
under the same name. 

This JIRA propose to add a configuration key indicating that temporary objects 
will also include object name as part of their temporary file name,

For example
"/data/a.txt" will be uploaded into 
"swift://container.mil01/data/.distcp.tmp.attempt_local2036034928_0001_m_00_0"/a.txt"
 or 
"swift://container.mil01/data/a.txt/.distcp.tmp.attempt_local2036034928_0001_m_00_0"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org