[ https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933687#comment-16933687 ]
Chandni Singh commented on YARN-9839: ------------------------------------- The root cause of this issue was an OS level configuration which was not letting OS to overcommit virtual memory. NM was not able to create more than 800 threads because kernel refused vmem allocation. However the code here in {{ResourceLocalizationService}} is quite old. For every container localization request, this service creates a new {{LocalizerRunner}} native thread. This is expensive. It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse previously constructed threads when they are available and only creates new when needed. This class needs a refactoring and I would like to use this jira to do that. cc. [~eyang] > NodeManager java.lang.OutOfMemoryError unable to create new native thread > ------------------------------------------------------------------------- > > Key: YARN-9839 > URL: https://issues.apache.org/jira/browse/YARN-9839 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Chandni Singh > Assignee: Chandni Singh > Priority: Major > > NM fails with the below error even though the ulimit for NM is large. > {code} > 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught > java.lang.OutOfMemoryError: unable to create new native thread. One possible > reason is that ulimit setting of 'max user processes' is too low. If so, do > 'ulimit -u <largerNum>' and try again. > 2019-09-12 10:27:46,348 FATAL > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[LocalizerRunner for > container_e95_1568242982456_152026_01_000132,5,main] threw an Error. > Shutting down now... > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:562) > at org.apache.hadoop.util.Shell.run(Shell.java:482) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:869) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:852) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114) > {code} > For each container localization request, there is a {{LocalizerRunner}} > thread created and each {{LocalizerRunner}} creates another thread to get > file permission info which is where we see this failure from. It is in > Shell.java -> {{runCommand()}} > {code} > Thread errThread = new Thread() { > @Override > public void run() { > try { > String line = errReader.readLine(); > while((line != null) && !isInterrupted()) { > errMsg.append(line); > errMsg.append(System.getProperty("line.separator")); > line = errReader.readLine(); > } > } catch(IOException ioe) { > LOG.warn("Error reading the error stream", ioe); > } > } > }; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org