[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169302#comment-14169302
 ] 

Karthik Kambatla commented on YARN-2566:
----------------------------------------

"Submitted" patch to kick off Jenkins. 

> IOException happen in startLocalizer of DefaultContainerExecutor due to not 
> enough disk space for the first localDir.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2566
>                 URL: https://issues.apache.org/jira/browse/YARN-2566
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
> YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
> YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
> YARN-2566.008.patch
>
>
> startLocalizer in DefaultContainerExecutor will only use the first localDir 
> to copy the token file, if the copy is failed for first localDir due to not 
> enough disk space in the first localDir, the localization will be failed even 
> there are plenty of disk space in other localDirs. We see the following error 
> for this case:
> {code}
> 2014-09-13 23:33:25,171 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
> create app directory 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
> java.io.IOException: mkdir of 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
>       at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
>       at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
>       at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>       at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,185 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.FileNotFoundException: File 
> file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
> does not exist
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
>       at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344)
>       at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
>       at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
>       at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
>       at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
>       at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>       at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
>       at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
>       at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,186 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1410663092546_0004_01_000001 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> 2014-09-13 23:33:25,187 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl    
> RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
>   APPID=application_1410663092546_0004    
> CONTAINERID=container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1410663092546_0004_01_000001 transitioned from 
> LOCALIZATION_FAILED to DONE
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1410663092546_0004_01_000001 from application 
> application_1410663092546_0004
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1410663092546_0004_01_000001 for 
> log-aggregation
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1410663092546_0004
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,187 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
> returned false for path: 
> [/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
> 2014-09-13 23:33:25,188 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,188 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
> returned false for path: 
> [/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
> 2014-09-13 23:33:25,291 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1410663092546_0004_01_000001
> 2014-09-13 23:33:26,159 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed container container_1410663092546_0004_01_000001
> {code}
> The correct way to do is If the IOException happened during the copy, try the 
> next the localDir, If all the localDirs are failed to copy, then throw a 
> exception. 
> I will create a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to