[ 
https://issues.apache.org/jira/browse/YARN-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352688#comment-17352688
 ] 

Jim Brennan commented on YARN-7713:
-----------------------------------

I'm not convinced that this is a good idea at all, for several reasons:
 # On our clusters, we rarely ever actually download a directory for 
localization. Based on scanning nodemanager logs on our clusters, the vast 
majority of localized files are individual files or archives. I think in 
general (at least here at Yahoo) it is not recommended to localize directories, 
because there are issues with tracking them - in particular, if they have 
subdirectories, changes in the subdirs will not be noticed. Since localizing 
directories is so rare, I don't think this optimization is worth the added 
complexity (at least in our use cases).
 # I agree with others that just splitting up by file counts alone is probably 
not ideal. File sizes can vary wildly.
 # More threads for localization is not necessarily a good thing. We currently 
have a configurable number of threads for public localizers (defaults to 4), 
plus 1 per container for private localizers. Increasing the number of threads 
running at once increases pressure on the NameNode, and for rotational disks, 
it may actually slow things down locally as well by increasing IOPS. SSD/NVME 
disks could probably handle more simultaneous localizers.
 # I don't like that FSDownload is just firing up some number of threads for 
Directories. I would prefer that the threading be done at a higher level 
(callers of FSDownload).
 # I think a better approach for allowing more threads for localization would 
be to to support parallel downloads in the private localizers, as suggested in 
YARN-574. Any solution needs to be configurable.

> Add parallel copying of directories into FSDownload
> ---------------------------------------------------
>
>                 Key: YARN-7713
>                 URL: https://issues.apache.org/jira/browse/YARN-7713
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Miklos Szegedi
>            Assignee: Christos Karampeazis-Papadakis
>            Priority: Major
>              Labels: newbie, pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> YARN currently copies directories sequentially when localizing. This could be 
> improved to do in parallel, since the source blocks are normally on different 
> nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to