[
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chandni Singh updated YARN-9839:
--------------------------------
Comment: was deleted
(was: Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data
as well which can only be released when the container resources have been
localized (message is received from the respective ContainerLocalizer)
{code}
final Map<LocalResourceRequest,LocalizerResourceRequestEvent> scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List<LocalizerResourceRequestEvent> pending;
{code}
This code needs to be modified so that the Thread itself is not cached but only
the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of
the container is done which can take much longer.
)
> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -------------------------------------------------------------------------
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Chandni Singh
> Assignee: Chandni Singh
> Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught
> java.lang.OutOfMemoryError: unable to create new native thread. One possible
> reason is that ulimit setting of 'max user processes' is too low. If so, do
> 'ulimit -u <largerNum>' and try again.
> 2019-09-12 10:27:46,348 FATAL
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
> Thread[LocalizerRunner for
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}}
> thread created and each {{LocalizerRunner}} creates another thread to get
> file permission info which is where we see this failure from. It is in
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
> @Override
> public void run() {
> try {
> String line = errReader.readLine();
> while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
> }
> } catch(IOException ioe) {
> LOG.warn("Error reading the error stream", ioe);
> }
> }
> };
> {code}
> {{LocalizerRunner}} are Threads which are cached in
> {{ResourceLocalizationService}}. Looking into a possibility if they are not
> getting removed from the cache.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]