[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627669#comment-13627669 ] Hudson commented on YARN-112: - Integrated in Hadoop-Yarn-trunk #179 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/179/]) YARN-112. Fixed a race condition during localization that fails containers. Contributed by Omkar Vinit Joshi. MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by Omkar Vinit Joshi. (Revision 1466196) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Fix For: 2.0.5-beta Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, yarn-112-20130409.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627758#comment-13627758 ] Hudson commented on YARN-112: - Integrated in Hadoop-Hdfs-trunk #1368 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1368/]) YARN-112. Fixed a race condition during localization that fails containers. Contributed by Omkar Vinit Joshi. MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by Omkar Vinit Joshi. (Revision 1466196) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Fix For: 2.0.5-beta Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, yarn-112-20130409.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627813#comment-13627813 ] Hudson commented on YARN-112: - Integrated in Hadoop-Mapreduce-trunk #1395 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1395/]) YARN-112. Fixed a race condition during localization that fails containers. Contributed by Omkar Vinit Joshi. MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by Omkar Vinit Joshi. (Revision 1466196) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Fix For: 2.0.5-beta Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, yarn-112-20130409.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626412#comment-13626412 ] Hadoop QA commented on YARN-112: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577749/yarn-112-20130409.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/693//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/693//console This message is automatically generated. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Fix For: 2.0.5-beta Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, yarn-112-20130409.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626999#comment-13626999 ] Vinod Kumar Vavilapalli commented on YARN-112: -- The patch passes eclipse on my laptop, I believe it is due to unclean .m2 on the build machines (HADOOP-9251). Checking this in. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Fix For: 2.0.5-beta Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, yarn-112-20130409.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627015#comment-13627015 ] Hudson commented on YARN-112: - Integrated in Hadoop-trunk-Commit #3584 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3584/]) YARN-112. Fixed a race condition during localization that fails containers. Contributed by Omkar Vinit Joshi. MAPREDUCE-5138. Fix LocalDistributedCacheManager after YARN-112. Contributed by Omkar Vinit Joshi. (Revision 1466196) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1466196 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/FSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestFSDownload.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Fix For: 2.0.5-beta Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, yarn-112-20130409.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625829#comment-13625829 ] Omkar Vinit Joshi commented on YARN-112: I am rebasing the patch as yarn-467 is committed and yarn-99 is updated. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625936#comment-13625936 ] Vinod Kumar Vavilapalli commented on YARN-112: -- This patch looks so much better! Two nits: - FSDownload: Remove the commented out code completely - TestFSDownload: Rename testRaceCondForFSDownload() to something like testUniqueDestinationPath() or something like that. A name race-condition isn't helping. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626117#comment-13626117 ] Hadoop QA commented on YARN-112: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577672/yarn-112-20130408.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.mapred.TestMRWithDistributedCache {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/689//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/689//console This message is automatically generated. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Fix For: 2.0.5-beta Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112-20130408.1.patch, yarn-112-20130408.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13617367#comment-13617367 ] Robert Joseph Evans commented on YARN-112: -- Vinod, I just glanced at the latest patch, I did not read it in detail, so if you say it covers that case I trust you. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616325#comment-13616325 ] Robert Joseph Evans commented on YARN-112: -- I agree that scale exposes races but, still the underlying problem is that we want to create a new unique directory. This seems very simple. {code} File uniqueDir = null; do { uniqueDir = new File(baseDir, String.valueOf(rand.nextLong())); } while (!uniqueDir.mkdir()); {code} I don't see why we are going through all of this complexity, simply because a FileContext API is broken. Playing games to make the race less likely is fine. But ultimately we still have to handle the race. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616327#comment-13616327 ] Robert Joseph Evans commented on YARN-112: -- Oh and the latest patch using a unique number will not always work, because the same code is used from different processes on the same box. We would have to have a way to guarantee uniqueness between the different processes. CurrentTimeMillis helps but still could result in a race. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616447#comment-13616447 ] Vinod Kumar Vavilapalli commented on YARN-112: -- bq. Playing games to make the race less likely is fine. But ultimately we still have to handle the race. bq. Oh and the latest patch using a unique number will not always work, because the same code is used from different processes on the same box. Bobby, the unique number generation is done in one single process and communicated down. ResourceTrackerService (NodeManager process) generates the unique path and passes it down to FSDownload (Localizer process), so we can avoid the race altogether. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Omkar Vinit Joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614402#comment-13614402 ] Robert Joseph Evans commented on YARN-112: -- I am not really sure that we fixed the underlying issue. {code}files.rename(dst_work, destDirPath, Rename.OVERWRITE);{code} threw an exception because there was something else in that directory already, but files.mkdir(destDirPath, cachePerms, false) is supposed to throw a FileAlreadyExistsException if the directory already exists. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html#mkdir%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.permission.FsPermission,%20boolean%29 files.rename should never get into this situation if files.rename threw the exception when it was supposed to. I tested this and {code} FileContext lfc = FileContext.getLocalFSFileContext(new Configuration()); Path p = new Path(/tmp/bobby.12345); FsPermission cachePerms = new FsPermission((short) 0755); lfc.mkdir(p, cachePerms, false); lfc.mkdir(p, cachePerms, false); {code} never throws an exception. We first need to address the bug in FileContext, and then we can look at how we can make FSDownload deal with mkdir throwing an exception, or whatever the fix ends up being. I filed HADOOP-9438 for this. If the fix ends up being that we do not support throwing the exception in FileContext, then your current solution looks OK. I also have a hard time believing that we are getting random collisions on a long value that should be fairly uniformly distributed. We need to guard against it either way and I suppose it is possible, but if I remember correctly we were seeing a significant number of these errors and my gut tells me that there is either something very wrong with Random, or there is something else also going on here. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614650#comment-13614650 ] Vinod Kumar Vavilapalli commented on YARN-112: -- Bobby, I too have seen in large clusters/jobs - the law of large numbers :) We don't see the random number generator. HADOOP-9438 will help, but I think instead of this solution, avoiding the race altogether by generating the destination path deterministically unique is a better solution. Something like localizer_id + random_num is a better destination path than plain random number. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614654#comment-13614654 ] Vinod Kumar Vavilapalli commented on YARN-112: -- bq. We don't see the random number generator. I meant seed* . Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614755#comment-13614755 ] omkar vinit joshi commented on YARN-112: Vinod's suggestion looks good to me and it will in fact simplify FSDownload logic. Adding unique number generator (AtomicLong) to LocalResourcesTrackerImpl so that random (in our case now unique) number generation will be centralized for public, private as well as application cache files. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13614929#comment-13614929 ] Hadoop QA commented on YARN-112: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12575629/yarn-112-20130326.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/614//console This message is automatically generated. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112-20130326.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613342#comment-13613342 ] Hadoop QA commented on YARN-112: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12575428/yarn-112-20130325.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1362 javac compiler warnings (more than the trunk's current 1361 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/594//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/594//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/594//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/594//console This message is automatically generated. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13613391#comment-13613391 ] Hadoop QA commented on YARN-112: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12575448/yarn-112-20130325.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/596//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/596//console This message is automatically generated. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, yarn-112.20131503.patch On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608333#comment-13608333 ] omkar vinit joshi commented on YARN-112: This problem is occurring mainly because createDir call on FileContext is not throwing exception in case the file system is RawLocalFileSystem. So if the directory is already present then new createDir will silently return instead of throwing exception. This is causing the race condition to occur in case two containers try to localize at the same time and get same random number. However rename call is an atomic call and to avoid the race condition we should use it. Earlier implementation 1) generate random num (r1) 2) check if the r1 is present.. if present go to 1 else 2 3) create directories r1 and r1_tmp 4) copy the files into r1_tmp 5) rename r1_tmp to r1 ( This is an atomic call and only one thread will succeed. Rest of them will fail. Error listed is just one of the errors which might be logged). Suggested Fix 1) generate random num (r1) 2) check if r1 is present if present go to 1) else 3) 3) create dir r1 4) rename r1 to r1_tmp (only one will succeed .. rest of the threads will get an exception and will continue to 1) 5) check if there exists file inside r1_tmp if present rename it back to r1 and go to 1) else go to 6 ( This check is added because if we get threads with same random number and passes check 2.. then one thread completely finishes download in which case it will rename r1_tmp back to r1... so for the other thread which now comes into picture rename call ( r1 to r1_tmp ) will succeed. However this should be avoided. This we can avoid by checking the contents of r1_tmp). 6) create r1 7) continue with actual file download. 8) rename r1_tmp to r1. Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: omkar vinit joshi On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-112) Race in localization can cause containers to fail
[ https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13458799#comment-13458799 ] Jason Lowe commented on YARN-112: - Here's the localization error that appeared in the nodemanager log when the first container failed: {noformat} [Node Status Updater]2012-09-18 14:39:04,476 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.xml transitioned from DOWNLOADING to LOCALIZED [IPC Server handler 4 on 8040]2012-09-18 14:39:04,484 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.jar, 1347979129443, ARCHIVE } [IPC Server handler 3 on 8040]RemoteTrace: java.io.IOException: Rename cannot overwrite non empty destination directory /xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:706) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:221) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:649) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:889) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:162) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Rename cannot overwrite non empty destination directory /xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626 at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:823) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:493) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:222) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46) at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1528) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1524) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1522) 2012-09-18 14:39:04,494 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1347923101942_0602_01_16 transitioned from LOCALIZING to LOCALIZATION_FAILED {noformat} Race in localization can cause containers to fail - Key: YARN-112 URL: https://issues.apache.org/jira/browse/YARN-112 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.3 Reporter: Jason Lowe On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the