[
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458799#comment-13458799
]
Jason Lowe commented on YARN-112:
---------------------------------
Here's the localization error that appeared in the nodemanager log when the
first container failed:
{noformat}
[Node Status Updater]2012-09-18 14:39:04,476 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.xml
transitioned from DOWNLOADING to LOCALIZED
[IPC Server handler 4 on 8040]2012-09-18 14:39:04,484 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
DEBUG: FAILED {
hdfs://xxx:xxx/user/somebody/.staging/job_1347923101942_0602/job.jar,
1347979129443, ARCHIVE }
[IPC Server handler 3 on 8040]RemoteTrace:
java.io.IOException: Rename cannot overwrite non empty destination directory
/xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626
at
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:706)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:221)
at
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:649)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:889)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:162)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
at LocalTrace:
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
Rename cannot overwrite non empty destination directory
/xxx/usercache/somebody/appcache/application_1347923101942_0602/filecache/3101732981627262626
at
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
at
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:823)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:493)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:222)
at
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
at
org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
at
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1528)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1524)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1522)
2012-09-18 14:39:04,494 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1347923101942_0602_01_000016 transitioned from LOCALIZING
to LOCALIZATION_FAILED
{noformat}
> Race in localization can cause containers to fail
> -------------------------------------------------
>
> Key: YARN-112
> URL: https://issues.apache.org/jira/browse/YARN-112
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 0.23.3
> Reporter: Jason Lowe
>
> On one of our 0.23 clusters, I saw a case of two containers, corresponding to
> two map tasks of a MR job, that were launched almost simultaneously on the
> same node. It appears they both tried to localize job.jar and job.xml at the
> same time. One of the containers failed when it couldn't rename the
> temporary job.jar directory to its final name because the target directory
> wasn't empty. Shortly afterwards the second container failed because job.xml
> could not be found, presumably because the first container removed it when it
> cleaned up.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira