[jira] [Commented] (YARN-467) Jobs fail during resource localization when public distributed-cache hits unix directory limits

Siddharth Seth (JIRA) Mon, 01 Apr 2013 11:15:17 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13618997#comment-13618997
 ]


Siddharth Seth commented on YARN-467:
-------------------------------------

bq. Another thing I've been looking hard is to see if 
LocalResourceTracker.localizationCompleted() can be done away with completely 
in favour of the handle() method. But to do that we need to handle both 
successful and failing localizations via handle(). I can already see a couple 
of bugs related to localization failures, so let's do this separately.
That could be the route to reach the LocalizedResources, instaed of sending 
events to them directly.  IAC, can be figured out in the follow-up jiras.

Had looked at this patch earlier as well; mostly looks good in terms of 
functionality. It was a little tough to read, hopefully some of the changes 
suggested by Vinod will make that easier. 
                
> Jobs fail during resource localization when public distributed-cache hits 
> unix directory limits
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-467
>                 URL: https://issues.apache.org/jira/browse/YARN-467
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Omkar Vinit Joshi
>            Assignee: Omkar Vinit Joshi
>         Attachments: yarn-467-20130322.1.patch, yarn-467-20130322.2.patch, 
> yarn-467-20130322.3.patch, yarn-467-20130322.patch, 
> yarn-467-20130325.1.patch, yarn-467-20130325.path, yarn-467-20130328.patch
>
>
> If we have multiple jobs which uses distributed cache with small size of 
> files, the directory limit reaches before reaching the cache size and fails 
> to create any directories in file cache (PUBLIC). The jobs start failing with 
> the below exception.
> java.io.IOException: mkdir of /tmp/nm-local-dir/filecache/3901886847734194975 
> failed
>       at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
>       at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
>       at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
>       at 
> org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
>       at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
>       at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> we need to have a mechanism where in we can create directory hierarchy and 
> limit number of files per directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-467) Jobs fail during resource localization when public distributed-cache hits unix directory limits

Reply via email to