[
https://issues.apache.org/jira/browse/YARN-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615894#comment-13615894
]
Omkar Vinit Joshi commented on YARN-467:
----------------------------------------
The Underlying problem here is that ResourceLocalization is trying to localize
files more than the allowed file limit per directory for the underlying local
file system.
Proposed Solution :- ( For Public resources - localized under :-
<local-dirs>/filecache/ )
We are going to maintain hierarchical directory structure inside the local
directories for filecache.
so the directory structure will look like this
.../filecache/<default-~8192-files>
.../filecache/<36 directories (0-9 & a-z)>/<default-~8192-files>
.../filecache/<36 directories (0-9 & a-z)>/<36 directories (0-9 & a-z)>
.....................
So in all every directory will have (8192-36) localized files and 36 sub
directories named 0-9 and a-z. These sub directories are created only if they
are required. They will not be created in advance. Likewise every sub directory
will have similar structure.
Now to manage files and to limit the number of files per directory to
HierarchicalDirectory#PER_DIR_FILE_LIMIT (in this case 8192) introducing below
classes / implementation.
* LocalResourcesTrackerImpl :-
** maintainHierarchicalDir :- a boolean flag. It should be set when you want
to use this resource tracker to track resources with hierarchical directory
structure.
** directoryMap :- Map of <Path, HierarchicalDirectory>. It makes sure that we
have one HierarchicalDirectory for every localPath. ( For example if we have
two local-dirs configured then it will have 2 entries.)
** inProgressRsrcMap :- Map of <LocalResourceRequest, Path>. This is used while
local resource is getting localized. This map helps in two ways
*** If the resource localization fails for that resource then we can retrieve
the path and remove the file reservation (file count)
*** If the LocalResourceRequest comes again for the same resourcerequest (
which is highly unlikely for today's implementation) it can return the same
path back.
** getPathForLocalResource :- This method should be called to retrieve the
Hierarchical directory path for the local-dir identified by the localDirPath.
Internally it adds this request and returned path to inProgressRsrcMap and
makes a reservation into the HierarchicalDirectory tracking this local-dir-path.
** decFileCountForHierarchicalPath :- It retrieves the localizedPath from
either inProgressRsrcMap or from LocalizedResource and then reduces file count
for the HierarchicalDirectory tracking it.
** localizationCompleted :- (Parameter - success) If true then it will only
update inProgressRsrcMap; otherwise it will update inProgressRsrcMap and will
also call decFileCountForHierarchicalPath.
* HierarchicalDirectory :- It just helps in managing hierarchical directories.
** PER_DIR_FILE_LIMIT :- It controls the files per directory /sub directories
of it. Can be controlled but should not be set too low
(YarnConfiguration.NM_LOCAL_CACHE_NUM_FILES_PER_DIRECTORY).
** DIRECTORIES_PER_LEVEL (constant 36) :- So every directory/sub-directory will
have total 36 directories only if they are required. ( 0-9 and a-z). Reason
behind using single character is the file length limit for windows.
** vacantSubDirectories :- Queue<HierarchicalSubDirectory> :- at the beginning
this will have root of the HierarchicalDirectory as the only sub directory. if
the queue becomes empty then new sub directory will be created starting with 0.
Note:- It will only create internal tracking for this and doesn't create an
actual directory on file system.
** knownSubDirectories :- Map of <String, HierarchicalSubDirectory> - Root
directory is identified by an empty string "" and then other sub directories by
their relative paths. like for directory 0:"0" for 0/a :"0/a"
** getHierarchicalPath :- (synchronized) This method returns the relative path
for the sub directory which is empty (has not reached its directory file
limit). If no empty sub directory is present then it will create one using
totalSubDirectories.
** decFileCountForPath :- (synchronized) This method reduces the count for the
HierarchicalSubDirectory representing the passed in relative path.
> Jobs fail during resource localization when public distributed-cache hits
> unix directory limits
> -----------------------------------------------------------------------------------------------
>
> Key: YARN-467
> URL: https://issues.apache.org/jira/browse/YARN-467
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.0.0, 2.0.0-alpha
> Reporter: Omkar Vinit Joshi
> Assignee: Omkar Vinit Joshi
> Attachments: yarn-467-20130322.1.patch, yarn-467-20130322.2.patch,
> yarn-467-20130322.3.patch, yarn-467-20130322.patch,
> yarn-467-20130325.1.patch, yarn-467-20130325.path
>
>
> If we have multiple jobs which uses distributed cache with small size of
> files, the directory limit reaches before reaching the cache size and fails
> to create any directories in file cache (PUBLIC). The jobs start failing with
> the below exception.
> java.io.IOException: mkdir of /tmp/nm-local-dir/filecache/3901886847734194975
> failed
> at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
> at
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
> at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
> at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
> at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
> at
> org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
> at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> we need to have a mechanism where in we can create directory hierarchy and
> limit number of files per directory.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira