[
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Advertising
Vinod Kumar Vavilapalli updated YARN-2566:
------------------------------------------
Issue Type: Sub-task (was: Bug)
Parent: YARN-91
> IOException happen in startLocalizer of DefaultContainerExecutor due to not
> enough disk space for the first localDir.
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-2566
> URL: https://issues.apache.org/jira/browse/YARN-2566
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.5.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Attachments: YARN-2566.000.patch, YARN-2566.001.patch,
> YARN-2566.002.patch, YARN-2566.003.patch
>
>
> startLocalizer in DefaultContainerExecutor will only use the first localDir
> to copy the token file, if the copy is failed for first localDir due to not
> enough disk space in the first localDir, the localization will be failed even
> there are plenty of disk space in other localDirs. We see the following error
> for this case:
> {code}
> 2014-09-13 23:33:25,171 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to
> create app directory
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
> java.io.IOException: mkdir of
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
> at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
> at
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
> at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
> at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
> at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
> at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,185 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> Localizer failed
> java.io.FileNotFoundException: File
> file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
> does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
> at
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
> at
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
> at
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344)
> at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
> at
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
> at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
> at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
> at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
> at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
> at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,186 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1410663092546_0004_01_000001 transitioned from
> LOCALIZING to LOCALIZATION_FAILED
> 2014-09-13 23:33:25,187 WARN
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera
> OPERATION=Container Finished - Failed TARGET=ContainerImpl
> RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
> APPID=application_1410663092546_0004
> CONTAINERID=container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,187 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
> Container container_1410663092546_0004_01_000001 transitioned from
> LOCALIZATION_FAILED to DONE
> 2014-09-13 23:33:25,187 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
> Removing container_1410663092546_0004_01_000001 from application
> application_1410663092546_0004
> 2014-09-13 23:33:25,187 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Considering container container_1410663092546_0004_01_000001 for
> log-aggregation
> 2014-09-13 23:33:25,187 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
> event CONTAINER_STOP for appId application_1410663092546_0004
> 2014-09-13 23:33:25,187 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> absolute path :
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,187 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete
> returned false for path:
> [/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
> 2014-09-13 23:33:25,188 INFO
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting
> absolute path :
> /hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,188 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete
> returned false for path:
> [/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
> 2014-09-13 23:33:25,291 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Stopping resource-monitoring for container_1410663092546_0004_01_000001
> 2014-09-13 23:33:26,159 INFO
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed
> completed container container_1410663092546_0004_01_000001
> {code}
> The correct way to do is If the IOException happened during the copy, try the
> next the localDir, If all the localDirs are failed to copy, then throw a
> exception.
> I will create a patch to fix this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)