[jira] [Commented] (YARN-99) Jobs fail during resource localization when directories in file cache reaches to unix directory limit

omkar vinit joshi (JIRA) Tue, 05 Mar 2013 17:35:13 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594210#comment-13594210
 ]


omkar vinit joshi commented on YARN-99:
---------------------------------------

The problem of large number of files (no. exceeding max file limit per 
directory) may occur at below (* marked) locations and it needs to be fixed at 
both the places. The directory structure shown is per local directory.
[For-any-local-dir]
   ---- [filecache *]
   ---- [usercache]
           [userid]
                ---- [filecache *]
                ---- [appcache]
                        ---- [appid]
                                ---- [filecache]
In application specific filecache it is highly improbable that we might hit 
that limit whereas in the other two places it is highly likely.

Proposed solution: - Adding an internal parameter (not making it externally 
configurable) to “localResourcesTrackerImpl” to control the hierarchical 
behavior for different type of resources. 
For hierarchical resources; managing hierarchy as follows
<orig-dir> = “path ending with –filecache- “
<orig-dir>
        ---- (8192 / 8k localized files)
        ---- 26 directories (a-z) ( Created only if the directory limit is 
reached.)
        ----- [a]
                ---- (8192 / 8k localized files)
                ---- 26 directories (a-z)
        .
        .
        .
        .
Reason for creating directories with single character;
1) For windows we also have maxpath limit (~255 characters)
2) With this hierarchy we can accommodate >1M files with 3 levels; which is 
practically sufficient.

Let me know if I need to look at any specific scenario / corner case / need to 
modify the design.

                
> Jobs fail during resource localization when directories in file cache reaches 
> to unix directory limit
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-99
>                 URL: https://issues.apache.org/jira/browse/YARN-99
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Devaraj K
>            Assignee: Devaraj K
>
> If we have multiple jobs which uses distributed cache with small size of 
> files, the directory limit reaches before reaching the cache size and fails 
> to create any directories in file cache. The jobs start failing with the 
> below exception.
> {code:xml}
> java.io.IOException: mkdir of 
> /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed
>       at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
>       at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
>       at 
> org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
>       at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> {code}
> We should have a mechanism to clean the cache files if it crosses specified 
> number of directories like cache size.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-99) Jobs fail during resource localization when directories in file cache reaches to unix directory limit

Reply via email to