[jira] [Comment Edited] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933687#comment-16933687
 ] 

Chandni Singh edited comment on YARN-9839 at 9/19/19 7:03 PM:
--

The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
ones when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 



was (Author: csingh):
The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}



--
This message was sent by 

[jira] [Comment Edited] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932017#comment-16932017
 ] 

Chandni Singh edited comment on YARN-9839 at 9/18/19 3:58 AM:
--

Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This code needs to be modified so that the Thread itself is not cached but only 
the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of 
the container is done which can take much longer.



was (Author: csingh):
Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This codes needs to be modified so that the Thread itself is not cached but 
only the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of 
the container is done which can take much longer.


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}
> 

[jira] [Comment Edited] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932017#comment-16932017
 ] 

Chandni Singh edited comment on YARN-9839 at 9/18/19 3:51 AM:
--

Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This codes needs to be modified so that the Thread itself is not cached but 
only the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of 
the container is done which can take much longer.



was (Author: csingh):
Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This codes needs to be modified so that the Thread itself is not cached but 
only the relevant information is cached.



> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}
> {{LocalizerRunner}} are Threads which are cached in 
> {{ResourceLocalizationService}}. Looking into a possibility if they are not 
>