[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933687#comment-16933687
 ] 

Chandni Singh edited comment on YARN-9839 at 9/19/19 7:03 PM:
--------------------------------------------------------------

The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
ones when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 



was (Author: csingh):
The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -------------------------------------------------------------------------
>
>                 Key: YARN-9839
>                 URL: https://issues.apache.org/jira/browse/YARN-9839
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u <largerNum>' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
>         at org.apache.hadoop.util.Shell.run(Shell.java:482)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
>         at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
>     Thread errThread = new Thread() {
>       @Override
>       public void run() {
>         try {
>           String line = errReader.readLine();
>           while((line != null) && !isInterrupted()) {
>             errMsg.append(line);
>             errMsg.append(System.getProperty("line.separator"));
>             line = errReader.readLine();
>           }
>         } catch(IOException ioe) {
>           LOG.warn("Error reading the error stream", ioe);
>         }
>       }
>     };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to