[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2566:
----------------------------
    Description: 
startLocalizer in DefaultContainerExecutor will only use the first localDir to 
copy the token file, if the copy is failed for first localDir due to not enough 
disk space in the first localDir, the localization will be failed even there 
are plenty of disk space in other localDirs. We see the following error for 
this case:
{code}
2014-09-13 23:33:25,171 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
create app directory 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
java.io.IOException: mkdir of 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
        at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
        at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
2014-09-13 23:33:25,185 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Localizer failed
java.io.FileNotFoundException: File 
file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does 
not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
        at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344)
        at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
        at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
        at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
        at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
2014-09-13 23:33:25,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1410663092546_0004_01_000001 transitioned from LOCALIZING 
to LOCALIZATION_FAILED
2014-09-13 23:33:25,187 WARN 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera     
OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  
DESCRIPTION=Container failed with state: LOCALIZATION_FAILED    
APPID=application_1410663092546_0004    
CONTAINERID=container_1410663092546_0004_01_000001
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1410663092546_0004_01_000001 transitioned from 
LOCALIZATION_FAILED to DONE
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Removing container_1410663092546_0004_01_000001 from application 
application_1410663092546_0004
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Considering container container_1410663092546_0004_01_000001 for 
log-aggregation
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_STOP for appId application_1410663092546_0004
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
absolute path : 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
2014-09-13 23:33:25,187 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
returned false for path: 
[/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
2014-09-13 23:33:25,188 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
absolute path : 
/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
2014-09-13 23:33:25,188 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
returned false for path: 
[/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
2014-09-13 23:33:25,291 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Stopping resource-monitoring for container_1410663092546_0004_01_000001
2014-09-13 23:33:26,159 INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
completed container container_1410663092546_0004_01_000001
{code}

The correct way to do is If the IOException happened during the copy, try the 
next the localDir, If all the localDirs are failed to copy, then throw a 
exception. 
I will create a patch to fix this issue.

  was:
startLocalizer in DefaultContainerExecutor will only use the first localDir to 
copy the token file, if the copy is failed for first localDir due to not enough 
disk space in the first localDir, the localization will be failed even there 
are plenty of disk space in other localDirs. We see the following error for 
this case:
{code}
2014-09-13 23:33:25,171 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
create app directory 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
java.io.IOException: mkdir of 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
        at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
        at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
        at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
2014-09-13 23:33:25,185 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Localizer failed
java.io.FileNotFoundException: File 
file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does 
not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
        at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344)
        at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
        at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
        at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
        at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
2014-09-13 23:33:25,186 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1410663092546_0004_01_000001 transitioned from LOCALIZING 
to LOCALIZATION_FAILED
2014-09-13 23:33:25,187 WARN 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera     
OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  
DESCRIPTION=Container failed with state: LOCALIZATION_FAILED    
APPID=application_1410663092546_0004    
CONTAINERID=container_1410663092546_0004_01_000001
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1410663092546_0004_01_000001 transitioned from 
LOCALIZATION_FAILED to DONE
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Removing container_1410663092546_0004_01_000001 from application 
application_1410663092546_0004
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
 Considering container container_1410663092546_0004_01_000001 for 
log-aggregation
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
event CONTAINER_STOP for appId application_1410663092546_0004
2014-09-13 23:33:25,187 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
absolute path : 
/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
2014-09-13 23:33:25,187 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
returned false for path: 
[/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
2014-09-13 23:33:25,188 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
absolute path : 
/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
2014-09-13 23:33:25,188 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
returned false for path: 
[/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
2014-09-13 23:33:25,291 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Stopping resource-monitoring for container_1410663092546_0004_01_000001
2014-09-13 23:33:26,159 INFO 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
completed container container_1410663092546_0004_01_000001
{code}

The correct way to do is If the IOException happened during the copy, try the 
next the localDir, If all the localDirs are failed to copy, then throw a 
exception. 
I create a patch to fix this issue.


> IOException happen in startLocalizer of DefaultContainerExecutor due to not 
> enough disk space for the first localDir.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-2566
>                 URL: https://issues.apache.org/jira/browse/YARN-2566
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>
> startLocalizer in DefaultContainerExecutor will only use the first localDir 
> to copy the token file, if the copy is failed for first localDir due to not 
> enough disk space in the first localDir, the localization will be failed even 
> there are plenty of disk space in other localDirs. We see the following error 
> for this case:
> {code}
> 2014-09-13 23:33:25,171 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
> create app directory 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
> java.io.IOException: mkdir of 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
>       at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
>       at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
>       at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>       at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,185 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.io.FileNotFoundException: File 
> file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
> does not exist
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
>       at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344)
>       at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
>       at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
>       at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
>       at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
>       at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>       at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
>       at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021)
>       at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
> 2014-09-13 23:33:25,186 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1410663092546_0004_01_000001 transitioned from 
> LOCALIZING to LOCALIZATION_FAILED
> 2014-09-13 23:33:25,187 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera   
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl    
> RESULT=FAILURE  DESCRIPTION=Container failed with state: LOCALIZATION_FAILED  
>   APPID=application_1410663092546_0004    
> CONTAINERID=container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1410663092546_0004_01_000001 transitioned from 
> LOCALIZATION_FAILED to DONE
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1410663092546_0004_01_000001 from application 
> application_1410663092546_0004
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1410663092546_0004_01_000001 for 
> log-aggregation
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
> event CONTAINER_STOP for appId application_1410663092546_0004
> 2014-09-13 23:33:25,187 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,187 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
> returned false for path: 
> [/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
> 2014-09-13 23:33:25,188 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001
> 2014-09-13 23:33:25,188 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete 
> returned false for path: 
> [/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001]
> 2014-09-13 23:33:25,291 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1410663092546_0004_01_000001
> 2014-09-13 23:33:26,159 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed container container_1410663092546_0004_01_000001
> {code}
> The correct way to do is If the IOException happened during the copy, try the 
> next the localDir, If all the localDirs are failed to copy, then throw a 
> exception. 
> I will create a patch to fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to