[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932064#comment-16932064
 ] 

Chandni Singh commented on YARN-9839:
-------------------------------------

Another issue is that the error from the {{LocalizerRunner}} thread which is 
created per container is causing NM to fail. 
In the {{LocalizerRunner -> run()}} method, if we don't want the NM to crash 
because localization is failing (even though it is OOM), we need to catch the 
{{Throwable}} not {{Error}}.

 {code}
      try {
        // Get nmPrivateDir
        nmPrivateCTokensPath = dirsHandler.getLocalPathForWrite(
                NM_PRIVATE_DIR + Path.SEPARATOR + tokenFileName);

        // 0) init queue, etc.
        // 1) write credentials to private dir
        writeCredentials(nmPrivateCTokensPath);
        // 2) exec initApplication and wait
        if (dirsHandler.areDisksHealthy()) {
          exec.startLocalizer(new LocalizerStartContext.Builder()
              .setNmPrivateContainerTokens(nmPrivateCTokensPath)
              .setNmAddr(localizationServerAddress)
              .setUser(context.getUser())
              .setAppId(context.getContainerId()
                  .getApplicationAttemptId().getApplicationId().toString())
              .setLocId(localizerId)
              .setDirsHandler(dirsHandler)
              .build());
        } else {
          throw new IOException("All disks failed. "
              + dirsHandler.getDisksHealthReport(false));
        }
      // TODO handle ExitCodeException separately?
      } catch (FSError fe) {
        exception = fe;
      } catch (Exception e) {
        exception = e;
      } 
{code}

> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -------------------------------------------------------------------------
>
>                 Key: YARN-9839
>                 URL: https://issues.apache.org/jira/browse/YARN-9839
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u <largerNum>' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
>         at org.apache.hadoop.util.Shell.run(Shell.java:482)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
>         at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
>         at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
>     Thread errThread = new Thread() {
>       @Override
>       public void run() {
>         try {
>           String line = errReader.readLine();
>           while((line != null) && !isInterrupted()) {
>             errMsg.append(line);
>             errMsg.append(System.getProperty("line.separator"));
>             line = errReader.readLine();
>           }
>         } catch(IOException ioe) {
>           LOG.warn("Error reading the error stream", ioe);
>         }
>       }
>     };
> {code}
> {{LocalizerRunner}} are Threads which are cached in 
> {{ResourceLocalizationService}}. Looking into a possibility if they are not 
> getting removed from the cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to