Nathan Roberts commented on YARN-2410:

One minor comment. If it's not too much trouble, could you add some comments 
about why we need to limit the number of FDs, how the code is accomplishing it, 
and why atomic operations need to be used in send_map(). It will help future 
maintainers of the code understand some of the rationale behind the 
complexities in this area. 

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> ----------------------------------------------------------------
>                 Key: YARN-2410
>                 URL: https://issues.apache.org/jira/browse/YARN-2410
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Nathan Roberts
>            Assignee: Kuhu Shukla
>         Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, 
> YARN-2410-v2.patch, YARN-2410-v3.patch, YARN-2410-v4.patch, 
> YARN-2410-v5.patch, YARN-2410-v6.patch, YARN-2410-v7.patch, 
> YARN-2410-v8.patch, YARN-2410-v9.patch
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=240000 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 

This message was sent by Atlassian JIRA

Reply via email to