Jason Lowe commented on YARN-2410:

Thanks for updating the patch.  I see now that since Shuffle is not a static 
class it's too much trouble to try to factor out the overrides for the test.

ReduceContext should be a private class.

sendMap takes a reduce context, a channel context, and an info map, but the 
latter two are already in the reduce context.  Seems like sendMap should just 
take a reduce context argument.

If sendMap returns null then I don't think we want messageReceived to blindly 
keep calling sendMap in the loop.

Why were the override decorators removed from verifyRequest and 
getMapOutputInfo?  It's pretty important that those actually override a method.

mockNetty is too monolothic like the old test and a bit unwieldy in that 
callers are expected to start mocking and then let mockNetty finish the job.  
The channel and message event mocking is just a few simple lines for each and 
would be fine to stay in the main test method.  Utility methods like 
createMockChannelFuture(channel, listenerList) and createMockHttpRequest() 
would help keep the original test method manageable in length and factor out 
some of the more complicated mocking of individual objects.

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> ----------------------------------------------------------------
>                 Key: YARN-2410
>                 URL: https://issues.apache.org/jira/browse/YARN-2410
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: Nathan Roberts
>            Assignee: Kuhu Shukla
>         Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, 
> YARN-2410-v6.patch, YARN-2410-v7.patch, YARN-2410-v8.patch
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=240000 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 

This message was sent by Atlassian JIRA

Reply via email to