[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread
[ https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933687#comment-16933687 ] Chandni Singh commented on YARN-9839: - The root cause of this issue was an OS level configuration which was not letting OS to overcommit virtual memory. NM was not able to create more than 800 threads because kernel refused vmem allocation. However the code here in {{ResourceLocalizationService}} is quite old. For every container localization request, this service creates a new {{LocalizerRunner}} native thread. This is expensive. It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse previously constructed threads when they are available and only creates new when needed. This class needs a refactoring and I would like to use this jira to do that. cc. [~eyang] > NodeManager java.lang.OutOfMemoryError unable to create new native thread > - > > Key: YARN-9839 > URL: https://issues.apache.org/jira/browse/YARN-9839 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > NM fails with the below error even though the ulimit for NM is large. > {code} > 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught > java.lang.OutOfMemoryError: unable to create new native thread. One possible > reason is that ulimit setting of 'max user processes' is too low. If so, do > 'ulimit -u ' and try again. > 2019-09-12 10:27:46,348 FATAL > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[LocalizerRunner for > container_e95_1568242982456_152026_01_000132,5,main] threw an Error. > Shutting down now... > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:562) > at org.apache.hadoop.util.Shell.run(Shell.java:482) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:869) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:852) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114) > {code} > For each container localization request, there is a {{LocalizerRunner}} > thread created and each {{LocalizerRunner}} creates another thread to get > file permission info which is where we see this failure from. It is in > Shell.java -> {{runCommand()}} > {code} > Thread errThread = new Thread() { > @Override > public void run() { > try { > String line = errReader.readLine(); > while((line != null) && !isInterrupted()) { > errMsg.append(line); > errMsg.append(System.getProperty("line.separator")); > line = errReader.readLine(); > } > } catch(IOException ioe) { > LOG.warn("Error reading the error stream", ioe); > } > } > }; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread
[ https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933262#comment-16933262 ] Steve Loughran commented on YARN-9839: -- FYI, I'm adding some tests in HADOOP-16570 which verify that one of the FS clients doesn't leak threads -caches the set at the start, compares those at the end, after filtering out some demon threads which don't ever go away. The same trick might work here > NodeManager java.lang.OutOfMemoryError unable to create new native thread > - > > Key: YARN-9839 > URL: https://issues.apache.org/jira/browse/YARN-9839 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > NM fails with the below error even though the ulimit for NM is large. > {code} > 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught > java.lang.OutOfMemoryError: unable to create new native thread. One possible > reason is that ulimit setting of 'max user processes' is too low. If so, do > 'ulimit -u ' and try again. > 2019-09-12 10:27:46,348 FATAL > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[LocalizerRunner for > container_e95_1568242982456_152026_01_000132,5,main] threw an Error. > Shutting down now... > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:562) > at org.apache.hadoop.util.Shell.run(Shell.java:482) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:869) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:852) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114) > {code} > For each container localization request, there is a {{LocalizerRunner}} > thread created and each {{LocalizerRunner}} creates another thread to get > file permission info which is where we see this failure from. It is in > Shell.java -> {{runCommand()}} > {code} > Thread errThread = new Thread() { > @Override > public void run() { > try { > String line = errReader.readLine(); > while((line != null) && !isInterrupted()) { > errMsg.append(line); > errMsg.append(System.getProperty("line.separator")); > line = errReader.readLine(); > } > } catch(IOException ioe) { > LOG.warn("Error reading the error stream", ioe); > } > } > }; > {code} > {{LocalizerRunner}} are Threads which are cached in > {{ResourceLocalizationService}}. Looking into a possibility if they are not > getting removed from the cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread
[ https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932064#comment-16932064 ] Chandni Singh commented on YARN-9839: - Another issue is that the error from the {{LocalizerRunner}} thread which is created per container is causing NM to fail. In the {{LocalizerRunner -> run()}} method, if we don't want the NM to crash because localization is failing (even though it is OOM), we need to catch the {{Throwable}} not {{Error}}. {code} try { // Get nmPrivateDir nmPrivateCTokensPath = dirsHandler.getLocalPathForWrite( NM_PRIVATE_DIR + Path.SEPARATOR + tokenFileName); // 0) init queue, etc. // 1) write credentials to private dir writeCredentials(nmPrivateCTokensPath); // 2) exec initApplication and wait if (dirsHandler.areDisksHealthy()) { exec.startLocalizer(new LocalizerStartContext.Builder() .setNmPrivateContainerTokens(nmPrivateCTokensPath) .setNmAddr(localizationServerAddress) .setUser(context.getUser()) .setAppId(context.getContainerId() .getApplicationAttemptId().getApplicationId().toString()) .setLocId(localizerId) .setDirsHandler(dirsHandler) .build()); } else { throw new IOException("All disks failed. " + dirsHandler.getDisksHealthReport(false)); } // TODO handle ExitCodeException separately? } catch (FSError fe) { exception = fe; } catch (Exception e) { exception = e; } {code} > NodeManager java.lang.OutOfMemoryError unable to create new native thread > - > > Key: YARN-9839 > URL: https://issues.apache.org/jira/browse/YARN-9839 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > NM fails with the below error even though the ulimit for NM is large. > {code} > 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught > java.lang.OutOfMemoryError: unable to create new native thread. One possible > reason is that ulimit setting of 'max user processes' is too low. If so, do > 'ulimit -u ' and try again. > 2019-09-12 10:27:46,348 FATAL > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[LocalizerRunner for > container_e95_1568242982456_152026_01_000132,5,main] threw an Error. > Shutting down now... > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:562) > at org.apache.hadoop.util.Shell.run(Shell.java:482) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:869) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:852) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114) > {code} > For each container localization request, there is a {{LocalizerRunner}} > thread created and each {{LocalizerRunner}} creates another thread to get > file permission info which is where we see this failure from. It is in > Shell.java -> {{runCommand()}} > {code} > Thread errThread = new Thread() { > @Override > public void run() { > try { > String line = errReader.readLine(); > while((line != null) && !isInterrupted()) { > errMsg.append(line); > errMsg.append(System.getProperty("line.separator")); > line = errReader.readLine(); > } > } catch(IOException ioe) { > LOG.warn("Error reading the error stream", ioe); >
[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread
[ https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932017#comment-16932017 ] Chandni Singh commented on YARN-9839: - Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea. The intention to cache it seems because the {{LocalizerRunner}} holds the data as well which can only be released when the container resources have been localized (message is received from the respective ContainerLocalizer) {code} final Map scheduled; // Its a shared list between Private Localizer and dispatcher thread. final List pending; {code} This codes needs to be modified so that the Thread itself is not cached but only the relevant information is cached. > NodeManager java.lang.OutOfMemoryError unable to create new native thread > - > > Key: YARN-9839 > URL: https://issues.apache.org/jira/browse/YARN-9839 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > > NM fails with the below error even though the ulimit for NM is large. > {code} > 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught > java.lang.OutOfMemoryError: unable to create new native thread. One possible > reason is that ulimit setting of 'max user processes' is too low. If so, do > 'ulimit -u ' and try again. > 2019-09-12 10:27:46,348 FATAL > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread > Thread[LocalizerRunner for > container_e95_1568242982456_152026_01_000132,5,main] threw an Error. > Shutting down now... > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:562) > at org.apache.hadoop.util.Shell.run(Shell.java:482) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:869) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:852) > at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659) > at > org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114) > {code} > For each container localization request, there is a {{LocalizerRunner}} > thread created and each {{LocalizerRunner}} creates another thread to get > file permission info which is where we see this failure from. It is in > Shell.java -> {{runCommand()}} > {code} > Thread errThread = new Thread() { > @Override > public void run() { > try { > String line = errReader.readLine(); > while((line != null) && !isInterrupted()) { > errMsg.append(line); > errMsg.append(System.getProperty("line.separator")); > line = errReader.readLine(); > } > } catch(IOException ioe) { > LOG.warn("Error reading the error stream", ioe); > } > } > }; > {code} > {{LocalizerRunner}} are Threads which are cached in > {{ResourceLocalizationService}}. Looking into a possibility if they are not > getting removed from the cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org