[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458799#comment-13458799
 ] 

Jason Lowe commented on YARN-112:
---------------------------------

Here's the localization error that appeared in the nodemanager log when the 
first container failed:

{noformat}
 [Node Status Updater]2012-09-18 14:39:04,476 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
 Resource hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.xml 
transitioned from DOWNLOADING to LOCALIZED
 [IPC Server handler 4 on 8040]2012-09-18 14:39:04,484 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 DEBUG: FAILED { 
hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.jar, 
1347979129443, ARCHIVE }
 [IPC Server handler 3 on 8040]RemoteTrace: 
java.io.IOException: Rename cannot overwrite non empty destination directory 
/xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626
        at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:706)
        at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:221)
        at 
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:649)
        at org.apache.hadoop.fs.FileContext.rename(FileContext.java:889)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:162)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
at LocalTrace: 
        org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
Rename cannot overwrite non empty destination directory 
/xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626
        at 
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
        at 
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:823)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:493)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:222)
        at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
        at 
org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
        at 
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1528)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1524)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1522)
2012-09-18 14:39:04,494 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1347923101942_0602_01_000016 transitioned from LOCALIZING 
to LOCALIZATION_FAILED
{noformat}
                
> Race in localization can cause containers to fail
> -------------------------------------------------
>
>                 Key: YARN-112
>                 URL: https://issues.apache.org/jira/browse/YARN-112
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
> two map tasks of a MR job, that were launched almost simultaneously on the 
> same node.  It appears they both tried to localize job.jar and job.xml at the 
> same time.  One of the containers failed when it couldn't rename the 
> temporary job.jar directory to its final name because the target directory 
> wasn't empty.  Shortly afterwards the second container failed because job.xml 
> could not be found, presumably because the first container removed it when it 
> cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to