[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-09 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v8.patch

Correcting 80 character line limit for test

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, 
> YARN-2410-v6.patch, YARN-2410-v7.patch, YARN-2410-v8.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-09 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v11.patch

Adding documentation comments.

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, 
> YARN-2410-v11.patch, YARN-2410-v2.patch, YARN-2410-v3.patch, 
> YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch, 
> YARN-2410-v7.patch, YARN-2410-v8.patch, YARN-2410-v9.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-09 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: (was: YARN-2410-v11.patch)

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, 
> YARN-2410-v2.patch, YARN-2410-v3.patch, YARN-2410-v4.patch, 
> YARN-2410-v5.patch, YARN-2410-v6.patch, YARN-2410-v7.patch, 
> YARN-2410-v8.patch, YARN-2410-v9.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-09 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v11.patch

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, 
> YARN-2410-v11.patch, YARN-2410-v2.patch, YARN-2410-v3.patch, 
> YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch, 
> YARN-2410-v7.patch, YARN-2410-v8.patch, YARN-2410-v9.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-09 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v10.patch

Fixing whitespace and checkstyle issues.

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, 
> YARN-2410-v2.patch, YARN-2410-v3.patch, YARN-2410-v4.patch, 
> YARN-2410-v5.patch, YARN-2410-v6.patch, YARN-2410-v7.patch, 
> YARN-2410-v8.patch, YARN-2410-v9.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-09 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v9.patch

sendMap should have only reduceContext as an argument. Test refactored to have 
helper methods.

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, 
> YARN-2410-v6.patch, YARN-2410-v7.patch, YARN-2410-v8.patch, YARN-2410-v9.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-08 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v7.patch

Modified ShuffleHandler to not use channel attachments. Moved MockNetty code to 
a helper method.

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, 
> YARN-2410-v6.patch, YARN-2410-v7.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-06 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v6.patch

Thank you so much [~jlowe] for the detailed feedback. I have made all but 2 
changes and request your further comments on that.

{quote}
Actually I'm not really sure why SendMapOutputParams exists separate from 
ReduceContext. There should be a one-to-one relationship there. 
{quote}

I totally agree. The only reason was findbugs which does not allow more than 7 
parameters in a function call( or the constructor that would populate these 
values). If this is not an issue, I can move them into a single class. For now 
I have made SendMapOutputParams an inner class to ReduceContext.

{quote}
Why was reduceContext added as a TestShuffleHandler instance variable? It's 
specific to the new test.
{quote}

The reduceContext is a variable holds the value set by the setAttachment() 
method and is used by the getAttachment() answer. If I declare it in the test 
method, it needs be final which cannot be done due to it being used by the 
setter. I am looking for another way. Let me know what you think.

All other items have been done. 

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-08-07 Thread Rohith Sharma K S (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-2410:

Target Version/s: 2.7.2
   Fix Version/s: (was: 2.7.2)

Updating field Target version as 2.7.2. Fix version is added when the issue is 
committed!!

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Kuhu Shukla
 Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
 YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch


 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-08-06 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v5.patch

This is the latest revised patch. A messageReceived() call uses two counters 
mapsToWait and mapsToSend within the ReduceContext class for throttling the 
number of sendMapOutput calls. Due to asynchronous nature of Netty, these 
counters are atomic. A revised test case that mocks Netty operations is also 
included. 

Every completed IO operation by sendMapOutput will start another until the 
entire mapIds list for a given request is processed.

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Kuhu Shukla
 Fix For: 2.7.2

 Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
 YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch


 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-07-26 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v4.patch

Revamped patch that uses a Map to store the number of openfiles per reduceId 
and passes the updated openfiles value through the channel as an attachment. 
The number of files that can be open per reducer is configurable.

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Kuhu Shukla
 Fix For: 2.7.2

 Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
 YARN-2410-v3.patch, YARN-2410-v4.patch


 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-07-15 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v3.patch

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Kuhu Shukla
 Fix For: 2.7.2

 Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
 YARN-2410-v3.patch


 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-07-14 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v1.patch

The ShuffleHandler messageReceived calls sendMapOutput only if the number of 
open files for a given reduceId is within a configurable limit value 
(mapreduce.shuffle.map.filecount). The count is incremented per call of 
sendMapOutput(). The channel is closed after this limit is reached.

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Kuhu Shukla
 Attachments: YARN-2410-v1.patch


 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-07-14 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v2.patch

Patch without no-prefix as git apply works without no-prefix.

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Kuhu Shukla
 Fix For: 2.7.2

 Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch


 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-07-07 Thread Chen He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen He updated YARN-2410:
--
Assignee: Kuhu Shukla  (was: Chen He)

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Kuhu Shukla

 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-05-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-2410:
-
Summary: Nodemanager ShuffleHandler can possible exhaust file descriptors  
(was: Nodemanager ShuffleHandler can easily exhaust file descriptors)

 Nodemanager ShuffleHandler can possible exhaust file descriptors
 

 Key: YARN-2410
 URL: https://issues.apache.org/jira/browse/YARN-2410
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nathan Roberts
Assignee: Chen He

 The async nature of the shufflehandler can cause it to open a huge number of
 file descriptors, when it runs out it crashes.
 Scenario:
 Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
 Let's say all 6K reduces hit a node at about same time asking for their
 outputs. Each reducer will ask for all 40 map outputs over a single socket in 
 a
 single request (not necessarily all 40 at once, but with coalescing it is
 likely to be a large number).
 sendMapOutput() will open the file for random reading and then perform an 
 async transfer of the particular portion of this file(). This will 
 theoretically
 happen 6000*40=24 times which will run the NM out of file descriptors and 
 cause it to crash.
 The algorithm should be refactored a little to not open the fds until they're
 actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)