[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627669#comment-13627669
 ] 

Hudson commented on YARN-112:
-

Integrated in Hadoop-Yarn-trunk #179 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/179/])
YARN-112. Fixed a race condition during localization that fails containers. 
Contributed by Omkar Vinit Joshi.
MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by 
Omkar Vinit Joshi. (Revision 1466196)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Fix For: 2.0.5-beta

 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, 
 yarn-112-20130409.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627758#comment-13627758
 ] 

Hudson commented on YARN-112:
-

Integrated in Hadoop-Hdfs-trunk #1368 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1368/])
YARN-112. Fixed a race condition during localization that fails containers. 
Contributed by Omkar Vinit Joshi.
MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by 
Omkar Vinit Joshi. (Revision 1466196)

 Result = FAILURE
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Fix For: 2.0.5-beta

 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, 
 yarn-112-20130409.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627813#comment-13627813
 ] 

Hudson commented on YARN-112:
-

Integrated in Hadoop-Mapreduce-trunk #1395 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1395/])
YARN-112. Fixed a race condition during localization that fails containers. 
Contributed by Omkar Vinit Joshi.
MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by 
Omkar Vinit Joshi. (Revision 1466196)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Fix For: 2.0.5-beta

 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, 
 yarn-112-20130409.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626412#comment-13626412
 ] 

Hadoop QA commented on YARN-112:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12577749/yarn-112-20130409.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/693//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/693//console

This message is automatically generated.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Fix For: 2.0.5-beta

 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, 
 yarn-112-20130409.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-09 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626999#comment-13626999
 ] 

Vinod Kumar Vavilapalli commented on YARN-112:
--

The patch passes eclipse on my laptop, I believe it is due to unclean .m2 on 
the build machines (HADOOP-9251).

Checking this in.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Fix For: 2.0.5-beta

 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, 
 yarn-112-20130409.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627015#comment-13627015
 ] 

Hudson commented on YARN-112:
-

Integrated in Hadoop-trunk-Commit #3584 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3584/])
YARN-112. Fixed a race condition during localization that fails containers. 
Contributed by Omkar Vinit Joshi.
MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by 
Omkar Vinit Joshi. (Revision 1466196)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Fix For: 2.0.5-beta

 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, 
 yarn-112-20130409.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-08 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625829#comment-13625829
 ] 

Omkar Vinit Joshi commented on YARN-112:


I am rebasing the patch as yarn-467 is committed and yarn-99 is updated.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-08 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625936#comment-13625936
 ] 

Vinod Kumar Vavilapalli commented on YARN-112:
--

This patch looks so much better!

Two nits:
 - FSDownload: Remove the commented out code completely
 - TestFSDownload: Rename testRaceCondForFSDownload() to something like 
testUniqueDestinationPath() or something like that. A name race-condition isn't 
helping.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-04-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626117#comment-13626117
 ] 

Hadoop QA commented on YARN-112:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12577672/yarn-112-20130408.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  org.apache.hadoop.mapred.TestMRWithDistributedCache

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/689//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/689//console

This message is automatically generated.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Fix For: 2.0.5-beta

 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-29 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617367#comment-13617367
 ] 

Robert Joseph Evans commented on YARN-112:
--

Vinod,

I just glanced at the latest patch, I did not read it in detail, so if you say 
it covers that case I trust you.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616325#comment-13616325
 ] 

Robert Joseph Evans commented on YARN-112:
--

I agree that scale exposes races but, still the underlying problem is that we 
want to create a new unique directory.  This seems very simple.

{code}
File uniqueDir = null;
do {
  uniqueDir = new File(baseDir, String.valueOf(rand.nextLong()));
} while (!uniqueDir.mkdir());
{code}

I don't see why we are going through all of this complexity, simply because a 
FileContext API is broken.  Playing games to make the race less likely is fine. 
 But ultimately we still have to handle the race.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-28 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616327#comment-13616327
 ] 

Robert Joseph Evans commented on YARN-112:
--

Oh and the latest patch using a unique number will not always work, because the 
same code is used from different processes on the same box.  We would have to 
have a way to guarantee uniqueness between the different processes.  
CurrentTimeMillis helps but still could result in a race.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-28 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616447#comment-13616447
 ] 

Vinod Kumar Vavilapalli commented on YARN-112:
--

bq. Playing games to make the race less likely is fine. But ultimately we still 
have to handle the race.
bq. Oh and the latest patch using a unique number will not always work, because 
the same code is used from different processes on the same box.
Bobby, the unique number generation is done in one single process and 
communicated down. ResourceTrackerService (NodeManager process) generates the 
unique path and passes it down to FSDownload (Localizer process), so we can 
avoid the race altogether.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: Omkar Vinit Joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-26 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614402#comment-13614402
 ] 

Robert Joseph Evans commented on YARN-112:
--

I am not really sure that we fixed the underlying issue.  

{code}files.rename(dst_work, destDirPath, Rename.OVERWRITE);{code}

threw an exception because there was something else in that directory already, 
but files.mkdir(destDirPath, cachePerms, false) is supposed to throw a 
FileAlreadyExistsException if the directory already exists.  

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html#mkdir%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.permission.FsPermission,%20boolean%29

files.rename should never get into this situation if files.rename threw the 
exception when it was supposed to.

I tested this and 
{code}
FileContext lfc = FileContext.getLocalFSFileContext(new Configuration());
Path p = new Path(/tmp/bobby.12345);
FsPermission cachePerms = new FsPermission((short) 0755);
lfc.mkdir(p, cachePerms, false);
lfc.mkdir(p, cachePerms, false);
{code}

never throws an exception.  We first need to address the bug in FileContext, 
and then we can look at how we can make FSDownload deal with mkdir throwing an 
exception, or whatever the fix ends up being.

I filed HADOOP-9438 for this.

If the fix ends up being that we do not support throwing the exception in 
FileContext, then your current solution looks OK.

I also have a hard time believing that we are getting random collisions on a 
long value that should be fairly uniformly distributed.  We need to guard 
against it either way and I suppose it is possible, but if I remember correctly 
we were seeing a significant number of these errors and my gut tells me that 
there is either something very wrong with Random, or there is something else 
also going on here.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-26 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614650#comment-13614650
 ] 

Vinod Kumar Vavilapalli commented on YARN-112:
--

Bobby, I too have seen in large clusters/jobs - the law of large numbers :) We 
don't see the random number generator.

HADOOP-9438 will help, but I think instead of this solution, avoiding the race 
altogether by generating the destination path deterministically unique is a 
better solution. Something like localizer_id + random_num is a better 
destination path than plain random number.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-26 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614654#comment-13614654
 ] 

Vinod Kumar Vavilapalli commented on YARN-112:
--

bq. We don't see the random number generator.
I meant seed* .

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-26 Thread omkar vinit joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614755#comment-13614755
 ] 

omkar vinit joshi commented on YARN-112:


Vinod's suggestion looks good to me and it will in fact simplify FSDownload 
logic. Adding unique number generator (AtomicLong) to LocalResourcesTrackerImpl 
so that random (in our case now unique) number generation will be centralized 
for public, private as well as application cache files.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614929#comment-13614929
 ] 

Hadoop QA commented on YARN-112:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12575629/yarn-112-20130326.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/614//console

This message is automatically generated.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112-20130326.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613342#comment-13613342
 ] 

Hadoop QA commented on YARN-112:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12575428/yarn-112-20130325.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1362 javac 
compiler warnings (more than the trunk's current 1361 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/594//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/594//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/594//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/594//console

This message is automatically generated.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.patch, yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613391#comment-13613391
 ] 

Hadoop QA commented on YARN-112:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12575448/yarn-112-20130325.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/596//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/596//console

This message is automatically generated.

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi
 Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
 yarn-112.20131503.patch


 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2013-03-20 Thread omkar vinit joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608333#comment-13608333
 ] 

omkar vinit joshi commented on YARN-112:


This problem is occurring mainly because createDir call on FileContext is not 
throwing exception in case the file system is RawLocalFileSystem. So if the 
directory is already present then new createDir will silently return instead of 
throwing exception. This is causing the race condition to occur in case two 
containers try to localize at the same time and get same random number. However 
rename call is an atomic call and to avoid the race condition we should use it.

Earlier implementation
1) generate random num (r1)
2) check if the r1 is present.. if present go to 1 else 2
3) create directories r1 and r1_tmp
4) copy the files into r1_tmp
5) rename r1_tmp to r1 ( This is an atomic call and only one thread will 
succeed. Rest of them will fail. Error listed is just one of the errors which 
might be logged).


Suggested Fix
1) generate random num (r1)
2) check if r1 is present if present go to 1) else 3)
3) create dir r1
4) rename r1 to r1_tmp (only one will succeed .. rest of the threads will get 
an exception and will continue to 1)
5) check if there exists file inside r1_tmp if present rename it back to r1 and 
go to 1) else go to 6 ( This check is added because if we get threads with same 
random number and passes check 2.. then one thread completely finishes download 
in which case it will rename r1_tmp back to r1... so for the other thread which 
now comes into picture rename call ( r1 to r1_tmp ) will succeed. However this 
should be avoided. This we can avoid by checking the contents of r1_tmp).
6) create r1
7) continue with actual file download.
8) rename r1_tmp to r1.



 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe
Assignee: omkar vinit joshi

 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the containers failed when it couldn't rename the 
 temporary job.jar directory to its final name because the target directory 
 wasn't empty.  Shortly afterwards the second container failed because job.xml 
 could not be found, presumably because the first container removed it when it 
 cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

2012-09-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13458799#comment-13458799
 ] 

Jason Lowe commented on YARN-112:
-

Here's the localization error that appeared in the nodemanager log when the 
first container failed:

{noformat}
 [Node Status Updater]2012-09-18 14:39:04,476 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.xml 
transitioned from DOWNLOADING to LOCALIZED
 [IPC Server handler 4 on 8040]2012-09-18 14:39:04,484 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 DEBUG: FAILED { 
hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.jar, 
1347979129443, ARCHIVE }
 [IPC Server handler 3 on 8040]RemoteTrace: 
java.io.IOException: Rename cannot overwrite non empty destination directory 
/xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:706)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:221)
at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:649)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:889)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:162)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
at LocalTrace: 
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
Rename cannot overwrite non empty destination directory 
/xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626
at 
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
at 
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:823)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:493)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:222)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
at 
org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
at 
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1528)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1524)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1522)
2012-09-18 14:39:04,494 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1347923101942_0602_01_16 transitioned from LOCALIZING 
to LOCALIZATION_FAILED
{noformat}

 Race in localization can cause containers to fail
 -

 Key: YARN-112
 URL: https://issues.apache.org/jira/browse/YARN-112
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 0.23.3
Reporter: Jason Lowe

 On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
 two map tasks of a MR job, that were launched almost simultaneously on the 
 same node.  It appears they both tried to localize job.jar and job.xml at the 
 same time.  One of the