[jira] [Commented] (YARN-10116) Expose diagnostics in RMAppManager summary

2020-02-04 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030291#comment-17030291
 ] 

Chandni Singh commented on YARN-10116:
--

The patch looks good to me. The test failure is unrelated to this patch. I 
could run the tests locally. [~jhung] please fix the checkstyle

 

> Expose diagnostics in RMAppManager summary
> --
>
> Key: YARN-10116
> URL: https://issues.apache.org/jira/browse/YARN-10116
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-10116.001.patch
>
>
> It's useful for tracking app diagnostics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933687#comment-16933687
 ] 

Chandni Singh commented on YARN-9839:
-

The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933687#comment-16933687
 ] 

Chandni Singh edited comment on YARN-9839 at 9/19/19 7:03 PM:
--

The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
ones when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 



was (Author: csingh):
The root cause of this issue was an OS level configuration  which was not 
letting OS to overcommit virtual memory. NM was not able to create more than 
800 threads because kernel refused vmem allocation.

However the code here in {{ResourceLocalizationService}} is quite old.  For 
every container localization request, this service creates a new 
{{LocalizerRunner}} native thread. This is expensive. 
 
It doesn't make use of {{ExcecutorService}} or {{Threadpools}} which can reuse 
previously constructed threads when they are available and only creates new 
when needed.

This class needs a refactoring and I would like to use this jira to do that.

cc. [~eyang] 


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}



--
This message was sent by 

[jira] [Updated] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-19 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9839:

Description: 
NM fails with the below error even though the ulimit for NM is large.

{code}
2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
java.lang.OutOfMemoryError: unable to create new native thread. One possible 
reason is that ulimit setting of 'max user processes' is too low. If so, do 
'ulimit -u ' and try again.
2019-09-12 10:27:46,348 FATAL 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LocalizerRunner for container_e95_1568242982456_152026_01_000132,5,main] 
threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
{code}

For each container localization request, there is a {{LocalizerRunner}} thread 
created and each {{LocalizerRunner}} creates another thread to get file 
permission info which is where we see this failure from. It is in Shell.java -> 
{{runCommand()}}

{code}
Thread errThread = new Thread() {
  @Override
  public void run() {
try {
  String line = errReader.readLine();
  while((line != null) && !isInterrupted()) {
errMsg.append(line);
errMsg.append(System.getProperty("line.separator"));
line = errReader.readLine();
  }
} catch(IOException ioe) {
  LOG.warn("Error reading the error stream", ioe);
}
  }
};
{code}






  was:
NM fails with the below error even though the ulimit for NM is large.

{code}
2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
java.lang.OutOfMemoryError: unable to create new native thread. One possible 
reason is that ulimit setting of 'max user processes' is too low. If so, do 
'ulimit -u ' and try again.
2019-09-12 10:27:46,348 FATAL 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LocalizerRunner for container_e95_1568242982456_152026_01_000132,5,main] 
threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
at 

[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932064#comment-16932064
 ] 

Chandni Singh commented on YARN-9839:
-

Another issue is that the error from the {{LocalizerRunner}} thread which is 
created per container is causing NM to fail. 
In the {{LocalizerRunner -> run()}} method, if we don't want the NM to crash 
because localization is failing (even though it is OOM), we need to catch the 
{{Throwable}} not {{Error}}.

 {code}
  try {
// Get nmPrivateDir
nmPrivateCTokensPath = dirsHandler.getLocalPathForWrite(
NM_PRIVATE_DIR + Path.SEPARATOR + tokenFileName);

// 0) init queue, etc.
// 1) write credentials to private dir
writeCredentials(nmPrivateCTokensPath);
// 2) exec initApplication and wait
if (dirsHandler.areDisksHealthy()) {
  exec.startLocalizer(new LocalizerStartContext.Builder()
  .setNmPrivateContainerTokens(nmPrivateCTokensPath)
  .setNmAddr(localizationServerAddress)
  .setUser(context.getUser())
  .setAppId(context.getContainerId()
  .getApplicationAttemptId().getApplicationId().toString())
  .setLocId(localizerId)
  .setDirsHandler(dirsHandler)
  .build());
} else {
  throw new IOException("All disks failed. "
  + dirsHandler.getDisksHealthReport(false));
}
  // TODO handle ExitCodeException separately?
  } catch (FSError fe) {
exception = fe;
  } catch (Exception e) {
exception = e;
  } 
{code}

> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   

[jira] [Issue Comment Deleted] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9839:

Comment: was deleted

(was: Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This code needs to be modified so that the Thread itself is not cached but only 
the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of 
the container is done which can take much longer.
)

> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}
> {{LocalizerRunner}} are Threads which are cached in 
> {{ResourceLocalizationService}}. Looking into a possibility if they are not 
> getting removed from the cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932017#comment-16932017
 ] 

Chandni Singh edited comment on YARN-9839 at 9/18/19 3:58 AM:
--

Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This code needs to be modified so that the Thread itself is not cached but only 
the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of 
the container is done which can take much longer.



was (Author: csingh):
Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This codes needs to be modified so that the Thread itself is not cached but 
only the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of 
the container is done which can take much longer.


> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}
> 

[jira] [Comment Edited] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932017#comment-16932017
 ] 

Chandni Singh edited comment on YARN-9839 at 9/18/19 3:51 AM:
--

Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This codes needs to be modified so that the Thread itself is not cached but 
only the relevant information is cached.
Right now the {{Thread}} object persists in memory until the localization of 
the container is done which can take much longer.



was (Author: csingh):
Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This codes needs to be modified so that the Thread itself is not cached but 
only the relevant information is cached.



> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}
> {{LocalizerRunner}} are Threads which are cached in 
> {{ResourceLocalizationService}}. Looking into a possibility if they are not 
> 

[jira] [Commented] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932017#comment-16932017
 ] 

Chandni Singh commented on YARN-9839:
-

Caching {{LocalizerRunner}} which is a {{Thread}} is not a good idea.
The intention to cache it seems because the {{LocalizerRunner}} holds the data 
as well which can only be released when the container resources have been 
localized  (message is received from the respective ContainerLocalizer)
{code}
final Map scheduled;
// Its a shared list between Private Localizer and dispatcher thread.
final List pending;
{code}

This codes needs to be modified so that the Thread itself is not cached but 
only the relevant information is cached.



> NodeManager java.lang.OutOfMemoryError unable to create new native thread
> -
>
> Key: YARN-9839
> URL: https://issues.apache.org/jira/browse/YARN-9839
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> NM fails with the below error even though the ulimit for NM is large.
> {code}
> 2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
> java.lang.OutOfMemoryError: unable to create new native thread. One possible 
> reason is that ulimit setting of 'max user processes' is too low. If so, do 
> 'ulimit -u ' and try again.
> 2019-09-12 10:27:46,348 FATAL 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[LocalizerRunner for 
> container_e95_1568242982456_152026_01_000132,5,main] threw an Error.  
> Shutting down now...
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
> at org.apache.hadoop.util.Shell.run(Shell.java:482)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
> at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> {code}
> For each container localization request, there is a {{LocalizerRunner}} 
> thread created and each {{LocalizerRunner}} creates another thread to get 
> file permission info which is where we see this failure from. It is in 
> Shell.java -> {{runCommand()}}
> {code}
> Thread errThread = new Thread() {
>   @Override
>   public void run() {
> try {
>   String line = errReader.readLine();
>   while((line != null) && !isInterrupted()) {
> errMsg.append(line);
> errMsg.append(System.getProperty("line.separator"));
> line = errReader.readLine();
>   }
> } catch(IOException ioe) {
>   LOG.warn("Error reading the error stream", ioe);
> }
>   }
> };
> {code}
> {{LocalizerRunner}} are Threads which are cached in 
> {{ResourceLocalizationService}}. Looking into a possibility if they are not 
> getting removed from the cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9839) NodeManager java.lang.OutOfMemoryError unable to create new native thread

2019-09-17 Thread Chandni Singh (Jira)
Chandni Singh created YARN-9839:
---

 Summary: NodeManager java.lang.OutOfMemoryError unable to create 
new native thread
 Key: YARN-9839
 URL: https://issues.apache.org/jira/browse/YARN-9839
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chandni Singh
Assignee: Chandni Singh


NM fails with the below error even though the ulimit for NM is large.

{code}
2019-09-12 10:27:46,348 ERROR org.apache.hadoop.util.Shell: Caught 
java.lang.OutOfMemoryError: unable to create new native thread. One possible 
reason is that ulimit setting of 'max user processes' is too low. If so, do 
'ulimit -u ' and try again.
2019-09-12 10:27:46,348 FATAL 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[LocalizerRunner for container_e95_1568242982456_152026_01_000132,5,main] 
threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:562)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:659)
at 
org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:634)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1441)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1405)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$800(ResourceLocalizationService.java:140)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
{code}

For each container localization request, there is a {{LocalizerRunner}} thread 
created and each {{LocalizerRunner}} creates another thread to get file 
permission info which is where we see this failure from. It is in Shell.java -> 
{{runCommand()}}

{code}
Thread errThread = new Thread() {
  @Override
  public void run() {
try {
  String line = errReader.readLine();
  while((line != null) && !isInterrupted()) {
errMsg.append(line);
errMsg.append(System.getProperty("line.separator"));
line = errReader.readLine();
  }
} catch(IOException ioe) {
  LOG.warn("Error reading the error stream", ioe);
}
  }
};
{code}

{{LocalizerRunner}} are Threads which are cached in 
{{ResourceLocalizationService}}. Looking into a possibility if they are not 
getting removed from the cache.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2019-03-27 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803551#comment-16803551
 ] 

Chandni Singh commented on YARN-9292:
-

{quote} 
Real container id of the application master provides the already initialized 
path and .cmd file is stored in existing container directory. cmd file gets 
clean up when application is finished. Using randomly generated container id 
will not clean up as nicely.
{quote}
[~eyang] In patch 6, a random container id is already being created on the 
client side which is the {{ServiceScheduler}}. It is creating a container id 
from the appId and the current system time.

{code}
+  ContainerId cid = ContainerId
+  .newContainerId(ApplicationAttemptId.newInstance(appId, 1),
+  System.currentTimeMillis());
{code}
 
For images, we probably need to write command file to a path independent of 
containers under nmPrivate directory.  Our code can ensure that once the 
command is executed, the temp .cmd file is deleted.

I do think it is important that we don't expose this API with 
container/container id in it because there is no logical relation between the 
image and the container.

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2019-03-27 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803429#comment-16803429
 ] 

Chandni Singh commented on YARN-9292:
-

[~eyang] The rest API added here to find the image is independent of any 
container. So I don't think we should have the container and container id in 
the path.
{code}
  @Path("/container/{id}/docker/images/{name}")
{code}
If this is done because the DockerCommandExecutor needs a container id, we 
could change the implementation here to use a dummy container id. This 
implementation couldd be fixed later but the rest API will not be affected and 
will remain unchanged..
{code}
 String output = DockerCommandExecutor.executeDockerCommand(
  dockerImagesCommand, id, null, privOpExecutor, false, nmContext);
{code}
We could generate a dummy container id here instead of doing it in every client.

Some other nitpicks:

1. Log statements in ServiceScheduler can be parameterized which improves 
readability.
{code}
  LOG.info("Docker image: " + id + " maps to: " + imageId); ->
 LOG.info("Docker image: {} maps to : {}", id, imageId);
{code}

2. There aren't any tests for the new code added to {{ServiceScheduler}}. Will 
it be possible to add one?

> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch, 
> YARN-9292.006.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9292) Implement logic to keep docker image consistent in application that uses :latest tag

2019-03-22 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799438#comment-16799438
 ] 

Chandni Singh commented on YARN-9292:
-

[~eyang] I have a hadoop-build-1000:latest locally
{code} docker images hadoop-build-1000:latest --format='{{json .}}' {code}
gives the below info 
{code} 
{"Containers":"N/A","CreatedAt":"2018-12-18 23:08:27 -0800 
PST","CreatedSince":"3 months 
ago","Digest":"\u003cnone\u003e","ID":"c9e7cc96aa61","Repository":"hadoop-build-1000","SharedSize":"N/A","Size":"2.01GB","Tag":"latest","UniqueSize":"N/A","VirtualSize":"2.013GB"}
{code}

However,
{code} docker image inspect hadoop-build-1000:latest --format={{.RepoDigests}}  
{code}
 doesn't return anything. 
The output of this command is 
{code}
[]
{code}



> Implement logic to keep docker image consistent in application that uses 
> :latest tag
> 
>
> Key: YARN-9292
> URL: https://issues.apache.org/jira/browse/YARN-9292
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9292.001.patch, YARN-9292.002.patch, 
> YARN-9292.003.patch, YARN-9292.004.patch, YARN-9292.005.patch
>
>
> Docker image with latest tag can run in YARN cluster without any validation 
> in node managers. If a image with latest tag is changed during containers 
> launch. It might produce inconsistent results between nodes. This is surfaced 
> toward end of development for YARN-9184 to keep docker image consistent 
> within a job. One of the ideas to keep :latest tag consistent for a job, is 
> to use docker image command to figure out the image id and use image id to 
> propagate to rest of the container requests. There are some challenges to 
> overcome:
>  # The latest tag does not exist on the node where first container starts. 
> The first container will need to download the latest image, and find image 
> ID. This can introduce lag time for other containers to start.
>  # If image id is used to start other container, container-executor may have 
> problems to check if the image is coming from a trusted source. Both image 
> name and ID must be supply through .cmd file to container-executor. However, 
> hacker can supply incorrect image id and defeat container-executor security 
> checks.
> If we can over come those challenges, it maybe possible to keep docker image 
> consistent with one application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Create Image Localizer

2019-03-21 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Description: 
Refer YARN-3854. 

Add Docker Image Localizer. The image localizer is part of 
{{ResourceLocalizationService}}. It serves the following purpose:

1. All image localization requests will be served by image localizer.
2. Image localizer initially runs {{DockerImagesCommand}} to find all images on 
the local node.
3. For an image localization request, it executes {{DockerPullCommand}} if the 
image is not present on the local node.
4. It returns the status of image localization by periodically executing 
{{DockerImagesCommand}} on a particular image. 

{{LinuxContainerExecutor}} is for container operations. DockerImagesCommand is 
independent of any container. The image localizer acts as a service that will 
localize docker images and maintain an image cache. Other components can use 
this to query about the images on the node.

  was:{{LinuxContainerExecutor}} is for container operations. 
DockerImagesCommand is independent of any container. The image localizer acts 
as a service that will localize docker images and maintain an image cache. 
Other components can use this to query about the images on the node.


> Create Image Localizer
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9378.001.patch
>
>
> Refer YARN-3854. 
> Add Docker Image Localizer. The image localizer is part of 
> {{ResourceLocalizationService}}. It serves the following purpose:
> 1. All image localization requests will be served by image localizer.
> 2. Image localizer initially runs {{DockerImagesCommand}} to find all images 
> on the local node.
> 3. For an image localization request, it executes {{DockerPullCommand}} if 
> the image is not present on the local node.
> 4. It returns the status of image localization by periodically executing 
> {{DockerImagesCommand}} on a particular image. 
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. The image localizer acts as a service that 
> will localize docker images and maintain an image cache. Other components can 
> use this to query about the images on the node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9249) Add support for docker image localization

2019-03-21 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh resolved YARN-9249.
-
Resolution: Duplicate

Duplicate of https://issues.apache.org/jira/browse/YARN-9378

> Add support for docker image localization
> -
>
> Key: YARN-9249
> URL: https://issues.apache.org/jira/browse/YARN-9249
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
>
> Refer YARN-3854. 
> Add Docker Image Localizer. The image localizer is part of 
> {{ResourceLocalizationService}}. It serves the following purpose:
> 1. All image localization requests will be served by image localizer.
> 2. Image localizer initially runs {{DockerImagesCommand}} to find all images 
> on the local node.
> 3. For an image localization request, it executes {{DockerPullCommand}} if 
> the image is not present on the local node.
> 4. It returns the status of image localization by periodically executing 
> {{DockerImagesCommand}} on a particular image. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5670) Add support for Docker image clean up

2019-03-21 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797855#comment-16797855
 ] 

Chandni Singh commented on YARN-5670:
-

{quote}
it is possible that some images tracking are lost from LRU and result in 
dangling images over time. 
{quote}
The cache can be backed by NMStateStore so even when the NM comes back or is 
restarted, it will know what images it localized.

The reason I am against using {{docker image prune}} is because there can be 
multiple images on that node which an admin may have pulled explicitly or some 
other process may have downloaded. Now, even if they don't use it within the 
last {{24 h}} or whatever time we have configured for the NM, the NM should not 
be the one deciding to remove that image. It is surprising for the admin/other 
process that the image they pulled is being mysteriously deleted.

> Add support for Docker image clean up
> -
>
> Key: YARN-5670
> URL: https://issues.apache.org/jira/browse/YARN-5670
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: Localization Support For Docker Images_002.pdf
>
>
> Regarding to Docker image localization, we also need a way to clean up the 
> old/stale Docker image to save storage space. We may extend deletion service 
> to utilize "docker rm" to do this.
> This is related to YARN-3854 and may depend on its implementation. Please 
> refer to YARN-3854 for Docker image localization details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5670) Add support for Docker image clean up

2019-03-20 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797675#comment-16797675
 ] 

Chandni Singh commented on YARN-5670:
-

[~eyang] I have attached the doc from YARN-3854 here as well which talks about 
the 2 options
 1. NM maintains an LRU cache which holds meta-info of docker images and it 
only deletes the images that it pulled/localized
 2. Using docker image prune

IMO Option 1 is better because {{docker image prune}} will remove all the 
images on the node, even the ones which the NM never localized.

 Also [~ebadger] had a comment in favor of this option. The link to it is:
 
https://issues.apache.org/jira/browse/YARN-3854?focusedCommentId=16645496=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16645496

> Add support for Docker image clean up
> -
>
> Key: YARN-5670
> URL: https://issues.apache.org/jira/browse/YARN-5670
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: Localization Support For Docker Images_002.pdf
>
>
> Regarding to Docker image localization, we also need a way to clean up the 
> old/stale Docker image to save storage space. We may extend deletion service 
> to utilize "docker rm" to do this.
> This is related to YARN-3854 and may depend on its implementation. Please 
> refer to YARN-3854 for Docker image localization details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5670) Add support for Docker image clean up

2019-03-20 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5670:

Attachment: Localization Support For Docker Images_002.pdf

> Add support for Docker image clean up
> -
>
> Key: YARN-5670
> URL: https://issues.apache.org/jira/browse/YARN-5670
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: Localization Support For Docker Images_002.pdf
>
>
> Regarding to Docker image localization, we also need a way to clean up the 
> old/stale Docker image to save storage space. We may extend deletion service 
> to utilize "docker rm" to do this.
> This is related to YARN-3854 and may depend on its implementation. Please 
> refer to YARN-3854 for Docker image localization details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5670) Add support for Docker image clean up

2019-03-20 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5670:

Attachment: (was: 
YARN-3854_Localization_support_for_Docker_image_v3.pdf)

> Add support for Docker image clean up
> -
>
> Key: YARN-5670
> URL: https://issues.apache.org/jira/browse/YARN-5670
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
>
> Regarding to Docker image localization, we also need a way to clean up the 
> old/stale Docker image to save storage space. We may extend deletion service 
> to utilize "docker rm" to do this.
> This is related to YARN-3854 and may depend on its implementation. Please 
> refer to YARN-3854 for Docker image localization details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5670) Add support for Docker image clean up

2019-03-20 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5670:

Attachment: YARN-3854_Localization_support_for_Docker_image_v3.pdf

> Add support for Docker image clean up
> -
>
> Key: YARN-5670
> URL: https://issues.apache.org/jira/browse/YARN-5670
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-3854_Localization_support_for_Docker_image_v3.pdf
>
>
> Regarding to Docker image localization, we also need a way to clean up the 
> old/stale Docker image to save storage space. We may extend deletion service 
> to utilize "docker rm" to do this.
> This is related to YARN-3854 and may depend on its implementation. Please 
> refer to YARN-3854 for Docker image localization details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9378) Create Image Localizer

2019-03-13 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792055#comment-16792055
 ] 

Chandni Singh commented on YARN-9378:
-

{quote}
In this context, there is no container id, maybe we can fix 
preparePrivilegedOperation to accept null instead of passing system time to 
make this more readable.
{quote}
If I remember correctly, preparePrivilegedOperator requires that string to 
construct the filename of the docker command file. I can re-check. We can think 
of a better string to pass, possibly the "image name and the version". 

> Create Image Localizer
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9378.001.patch
>
>
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. The image localizer acts as a service that 
> will localize docker images and maintain an image cache. Other components can 
> use this to query about the images on the node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9378) Create Image Localizer

2019-03-12 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790997#comment-16790997
 ] 

Chandni Singh commented on YARN-9378:
-

[~eyang] I attached the patch in which I had created an ImageLocalizer that 
executes the DockerImagePull command.
The reason I didn't integrate DockerImagePull command with 
LinuxContainerExecutor is because that API is tied to Containers but this 
command is independent of containers. 

Please take a look at {{ImageLocalizer.java}} class and these methods
- findLocalImages() 
- ImageStatusRetriever.run()

Let me know your thoughts. 


> Create Image Localizer
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9378.001.patch
>
>
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. The image localizer acts as a service that 
> will localize docker images and maintain an image cache. Other components can 
> use this to query about the images on the node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Create Image Localizer

2019-03-12 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Description: {{LinuxContainerExecutor}} is for container operations. 
DockerImagesCommand is independent of any container. The image localizer acts 
as a service that will localize docker images and maintain an image cache. 
Other components can use this to query about the images on the node.  (was: 
{{LinuxContainerExecutor}} is for container operations. DockerImagesCommand is 
independent of any container. The image localizer acts as a service that will 
localize docker images and maintain a cache.)

> Create Image Localizer
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9378.001.patch
>
>
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. The image localizer acts as a service that 
> will localize docker images and maintain an image cache. Other components can 
> use this to query about the images on the node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Create Image Localizer

2019-03-12 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Attachment: YARN-9378.001.patch

> Create Image Localizer
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9378.001.patch
>
>
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. The image localizer acts as a service that 
> will localize docker images and maintain a cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Create Image Localizer

2019-03-12 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Description: {{LinuxContainerExecutor}} is for container operations. 
DockerImagesCommand is independent of any container. The image localizer acts 
as a service that will localize docker images and maintain a cache.  (was: 
{{LinuxContainerExecutor}} is for container operations. DockerImagesCommand is 
independent of any container. I )

> Create Image Localizer
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. The image localizer acts as a service that 
> will localize docker images and maintain a cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Create Image Localizer

2019-03-12 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Summary: Create Image Localizer  (was: Use of DockerImagesCommand in 
ImageLocalizer)

> Create Image Localizer
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. I 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Use of DockerImagesCommand in ImageLocalizer

2019-03-12 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Description: {{LinuxContainerExecutor}} is for container operations. 
DockerImagesCommand is independent of any container. I   (was: {{LinuxConta)

> Use of DockerImagesCommand in ImageLocalizer
> 
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> {{LinuxContainerExecutor}} is for container operations. DockerImagesCommand 
> is independent of any container. I 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Use of DockerImagesCommand in ImageLocalizer

2019-03-12 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Summary: Use of DockerImagesCommand in ImageLocalizer  (was: Integrate 
Docker image command with the LinuxContainerExecutor)

> Use of DockerImagesCommand in ImageLocalizer
> 
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> {{LinuxConta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9378) Integrate Docker image command with the LinuxContainerExecutor

2019-03-12 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9378:

Description: {{LinuxConta

> Integrate Docker image command with the LinuxContainerExecutor
> --
>
> Key: YARN-9378
> URL: https://issues.apache.org/jira/browse/YARN-9378
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> {{LinuxConta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9378) Integrate Docker image command with the LinuxContainerExecutor

2019-03-11 Thread Chandni Singh (JIRA)
Chandni Singh created YARN-9378:
---

 Summary: Integrate Docker image command with the 
LinuxContainerExecutor
 Key: YARN-9378
 URL: https://issues.apache.org/jira/browse/YARN-9378
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chandni Singh
Assignee: Chandni Singh






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9249) Add support for docker image localization

2019-03-04 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9249:

Description: 
Refer YARN-3854. 

Add Docker Image Localizer. The image localizer is part of 
{{ResourceLocalizationService}}. It serves the following purpose:

1. All image localization requests will be served by image localizer.
2. Image localizer initially runs {{DockerImagesCommand}} to find all images on 
the local node.
3. For an image localization request, it executes {{DockerPullCommand}} if the 
image is not present on the local node.
4. It returns the status of image localization by periodically executing 
{{DockerImagesCommand}} on a particular image. 

  was:Refer YARN-3854.


> Add support for docker image localization
> -
>
> Key: YARN-9249
> URL: https://issues.apache.org/jira/browse/YARN-9249
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
>
> Refer YARN-3854. 
> Add Docker Image Localizer. The image localizer is part of 
> {{ResourceLocalizationService}}. It serves the following purpose:
> 1. All image localization requests will be served by image localizer.
> 2. Image localizer initially runs {{DockerImagesCommand}} to find all images 
> on the local node.
> 3. For an image localization request, it executes {{DockerPullCommand}} if 
> the image is not present on the local node.
> 4. It returns the status of image localization by periodically executing 
> {{DockerImagesCommand}} on a particular image. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9273) Flexing a component of YARN service does not work as documented when using relative number

2019-02-28 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780709#comment-16780709
 ] 

Chandni Singh commented on YARN-9273:
-

{quote}
I chose option 1 because changing the Component Spec means changing the public 
API, and it seems to me that it's a bit overkill.
{quote}
[~masatana] Option 1 doesn't fix the bug when the user is directly posting to 
the Api Server. It only works when the user does it via Yarn Cli. Also adding a 
field (whether the num containers is relative or not) to Component Spec is 
backward compatible. We can't change the of {{num containers}} to {{String}} 
because then it will be backward incompatible.

> Flexing a component of YARN service does not work as documented when using 
> relative number
> --
>
> Key: YARN-9273
> URL: https://issues.apache.org/jira/browse/YARN-9273
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Masahiro Tanaka
>Assignee: Masahiro Tanaka
>Priority: Minor
> Attachments: YARN-9273.001.patch, YARN-9273.002.patch, 
> YARN-9273.003.patch, YARN-9273.004.patch
>
>
> [The 
> documents|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html]
>  says,
>  "Relative changes are also supported for the ${NUMBER_OF_CONTAINERS} in the 
> flex command, such as +2 or -2." when you want to flex a component of a YARN 
> service.
> I expected that {{yarn app -flex sleeper-service -component sleeper +1}} 
> increments the number of container, but actually it sets the number of 
> container to just one.
> I guess ApiServiceClient#actionFlex treats flexing when executing the {{yarn 
> app -flex}}, and it just uses {{Long.parseLong}} to convert the argument like 
> {{+1}}, which doesn't care relative numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9273) Flexing a component of YARN service does not work as documented when using relative number

2019-02-27 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779744#comment-16779744
 ] 

Chandni Singh commented on YARN-9273:
-

[~masatana] Thanks for reporting this bug and working on it. I have a comment 
on patch 4.

User can flex a service just by directly positing to Api Server, by using curl. 
This doesn't use {{ApiServiceClient}}. If a user directly posts to the Api 
Server to flex the component with +1, they will still face the issue. 

So, I think the below changes in the {{ApiServiceClient}} are not needed. IMO, 
the problem is that the type of num containers is {{Long}} which is why +1 is 
evaluated to 1. I think we need a way in Component spec to specify that this is 
a relative change.

Also, similar logic to what is being added in this patch,  already exists in 
{{ServiceClient-> parseNumberOfContainers}}.  

{code}
// We have to check the original number of container of the app
476   // so that we can do relative changes.
477   ClientResponse response = getApiClient(getServicePath(appName))
478   .get(ClientResponse.class);
479   if (response.getStatus() == 404 || response.getStatus() != 200) {
480 throw new YarnException(
481 MessageFormat.format(
482 "Fail to check the current application status: 
{0}",
483 appName));
484   }
485   Service currentService = jsonSerDeser.fromJson(
486   response.getEntity(String.class));
475   Service service = new Service();  487   Service service = 
new Service();
476   service.setName(appName); 488   service.setName(appName);
477   service.setState(ServiceState.FLEX);  489   
service.setState(ServiceState.FLEX);
478   for (Map.Entry entry : 
componentCounts.entrySet()) {  490   for (Map.Entry 
entry : componentCounts.entrySet()) {
479 Component component = new Component();  491 
Component component = new Component();
480 component.setName(entry.getKey());  492 
component.setName(entry.getKey());
481 Long numberOfContainers = Long.parseLong(entry.getValue()); 

493 Long currentNumberOfContainer = currentService.getComponent(
494 entry.getKey()).getNumberOfContainers();
495 Long numberOfContainers = 
ServiceApiUtil.parseNumberOfContainers(
496 currentNumberOfContainer, entry.getValue(), 
entry.getKey());
{code}
 


> Flexing a component of YARN service does not work as documented when using 
> relative number
> --
>
> Key: YARN-9273
> URL: https://issues.apache.org/jira/browse/YARN-9273
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Masahiro Tanaka
>Assignee: Masahiro Tanaka
>Priority: Minor
> Attachments: YARN-9273.001.patch, YARN-9273.002.patch, 
> YARN-9273.003.patch, YARN-9273.004.patch
>
>
> [The 
> documents|https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html]
>  says,
>  "Relative changes are also supported for the ${NUMBER_OF_CONTAINERS} in the 
> flex command, such as +2 or -2." when you want to flex a component of a YARN 
> service.
> I expected that {{yarn app -flex sleeper-service -component sleeper +1}} 
> increments the number of container, but actually it sets the number of 
> container to just one.
> I guess ApiServiceClient#actionFlex treats flexing when executing the {{yarn 
> app -flex}}, and it just uses {{Long.parseLong}} to convert the argument like 
> {{+1}}, which doesn't care relative numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9245) Add support for Docker Images command

2019-02-27 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779726#comment-16779726
 ] 

Chandni Singh commented on YARN-9245:
-

Thanks [~eyang]

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Fix For: 3.3.0
>
> Attachments: YARN-9245.001.patch, YARN-9245.002.patch, 
> YARN-9245.003.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9245) Add support for Docker Images command

2019-02-26 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778655#comment-16778655
 ] 

Chandni Singh commented on YARN-9245:
-

[~eyang] I have used the format as {{json .}} in patch 3. Please take a look 
and let me know if this looks good. 

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9245.001.patch, YARN-9245.002.patch, 
> YARN-9245.003.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9245) Add support for Docker Images command

2019-02-26 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9245:

Attachment: YARN-9245.003.patch

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9245.001.patch, YARN-9245.002.patch, 
> YARN-9245.003.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9245) Add support for Docker Images command

2019-02-05 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761361#comment-16761361
 ] 

Chandni Singh commented on YARN-9245:
-

In patch 2:
- Added support for listing all the images
- Added a format of images

YARN-9249 will use this command, so with that it will be clear how this is 
going to be used.


> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9245.001.patch, YARN-9245.002.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9245) Add support for Docker Images command

2019-02-05 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9245:

Attachment: YARN-9245.002.patch

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9245.001.patch, YARN-9245.002.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9245) Add support for Docker Images command

2019-01-29 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755346#comment-16755346
 ] 

Chandni Singh edited comment on YARN-9245 at 1/29/19 8:10 PM:
--

[~eyang] [~ebadger] [~Jim_Brennan] could you please review?


was (Author: csingh):
[~eyang] [~ebadger] could you please review?

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9245.001.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9245) Add support for Docker Images command

2019-01-29 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755346#comment-16755346
 ] 

Chandni Singh commented on YARN-9245:
-

[~eyang] [~ebadger] could you please review?

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9245.001.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9245) Add support for Docker Images command

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9245:

Attachment: YARN-9245.001.patch

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9245.001.patch
>
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9249) Add support for docker image localization

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9249:

Description: Refer YARN-3854.

> Add support for docker image localization
> -
>
> Key: YARN-9249
> URL: https://issues.apache.org/jira/browse/YARN-9249
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> Refer YARN-3854.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9249) Add support for docker image localization

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9249:

Labels: Docker  (was: )

> Add support for docker image localization
> -
>
> Key: YARN-9249
> URL: https://issues.apache.org/jira/browse/YARN-9249
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
>
> Refer YARN-3854.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9249) Add support for docker image localization

2019-01-29 Thread Chandni Singh (JIRA)
Chandni Singh created YARN-9249:
---

 Summary: Add support for docker image localization
 Key: YARN-9249
 URL: https://issues.apache.org/jira/browse/YARN-9249
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chandni Singh
Assignee: Chandni Singh






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9245) Add support for Docker Images command

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9245:

Issue Type: Sub-task  (was: Task)
Parent: YARN-3854

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5670) Add support for Docker image clean up

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-5670:

Parent Issue: YARN-3854  (was: YARN-8472)

> Add support for Docker image clean up
> -
>
> Key: YARN-5670
> URL: https://issues.apache.org/jira/browse/YARN-5670
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
>
> Regarding to Docker image localization, we also need a way to clean up the 
> old/stale Docker image to save storage space. We may extend deletion service 
> to utilize "docker rm" to do this.
> This is related to YARN-3854 and may depend on its implementation. Please 
> refer to YARN-3854 for Docker image localization details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8867) Retrieve the status of resource localization

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8867:

Parent Issue: YARN-3854  (was: YARN-8472)

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.007.patch, YARN-8867.008.patch, 
> YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-3854) Add localization support for docker images

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-3854:

Issue Type: Improvement  (was: Sub-task)
Parent: (was: YARN-8472)

> Add localization support for docker images
> --
>
> Key: YARN-3854
> URL: https://issues.apache.org/jira/browse/YARN-3854
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Sidharta Seethana
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
> Attachments: Localization Support For Docker Images.pdf, Localization 
> Support For Docker Images_002.pdf, YARN-3854-branch-2.8.001.patch, 
> YARN-3854_Localization_support_for_Docker_image_v1.pdf, 
> YARN-3854_Localization_support_for_Docker_image_v2.pdf, 
> YARN-3854_Localization_support_for_Docker_image_v3.pdf
>
>
> We need the ability to localize docker images when those images aren't 
> already available locally. There are various approaches that could be used 
> here with different trade-offs/issues : image archives on HDFS + docker load 
> ,  docker pull during the localization phase or (automatic) docker pull 
> during the run/launch phase. 
> We also need the ability to clean-up old/stale, unused images. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9245) Add support for Docker Images command

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9245:

Summary: Add support for Docker Images command  (was: Add support to find 
out when a docker image pull is complete)

> Add support for Docker Images command
> -
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9245) Add support to find out when a docker image pull is complete

2019-01-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9245:

Description: 
Refer https://issues.apache.org/jira/browse/YARN-3854

Need a way to find out whether a docker pull is completed or the docker image 
is present locally. Just executing the below docker command can provide the 
information
{code}
docker images  [REPOSITORY[:TAG]]
{code} 

Will add support for docker images command with this jira.


  was:
Refer https://issues.apache.org/jira/browse/YARN-3854

Need a way to find out whether a docker pull is completed or the docker image 
is present locally. Just executing the below docker command can provide the 
information
{code}
docker images  [REPOSITORY[:TAG]]
{code} 



> Add support to find out when a docker image pull is complete
> 
>
> Key: YARN-9245
> URL: https://issues.apache.org/jira/browse/YARN-9245
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: Docker
>
> Refer https://issues.apache.org/jira/browse/YARN-3854
> Need a way to find out whether a docker pull is completed or the docker image 
> is present locally. Just executing the below docker command can provide the 
> information
> {code}
> docker images  [REPOSITORY[:TAG]]
> {code} 
> Will add support for docker images command with this jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9245) Add support to find out when a docker image pull is complete

2019-01-28 Thread Chandni Singh (JIRA)
Chandni Singh created YARN-9245:
---

 Summary: Add support to find out when a docker image pull is 
complete
 Key: YARN-9245
 URL: https://issues.apache.org/jira/browse/YARN-9245
 Project: Hadoop YARN
  Issue Type: Task
  Components: yarn
Reporter: Chandni Singh
Assignee: Chandni Singh


Refer https://issues.apache.org/jira/browse/YARN-3854

Need a way to find out whether a docker pull is completed or the docker image 
is present locally. Just executing the below docker command can provide the 
information
{code}
docker images  [REPOSITORY[:TAG]]
{code} 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8867) Retrieve the status of resource localization

2019-01-24 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751681#comment-16751681
 ] 

Chandni Singh commented on YARN-8867:
-

Ran these unit tests couple of time locally and they do pass.

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.007.patch, YARN-8867.008.patch, 
> YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8867) Retrieve the status of resource localization

2019-01-24 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751419#comment-16751419
 ] 

Chandni Singh commented on YARN-8867:
-

Addressed [~eyang]'s comments, test failure in TestServiceAM and checkstyle 
warnings in patch 8.

The test failure in hadoop-yarn-server-resourcemanager seems unrelated to this 
patch. Don't see any tests in rm failing. 
{code}
[WARNING] Tests run: 2449, Failures: 0, Errors: 0, Skipped: 7
[INFO] 
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 01:26 h
[INFO] Finished at: 2019-01-24T04:40:54+00:00
[INFO] Final Memory: 24M/827M
[INFO] 
[WARNING] The requested profile "parallel-tests" could not be activated because 
it does not exist.
[WARNING] The requested profile "native" could not be activated because it does 
not exist.
[WARNING] The requested profile "yarn-ui" could not be activated because it 
does not exist.
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) on 
project hadoop-yarn-server-resourcemanager: There was a timeout or other error 
in the fork -> [Help 1]
{code}


> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.007.patch, YARN-8867.008.patch, 
> YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8867) Retrieve the status of resource localization

2019-01-23 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8867:

Attachment: YARN-8867.008.patch

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.007.patch, YARN-8867.008.patch, 
> YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8867) Retrieve the status of resource localization

2019-01-23 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750413#comment-16750413
 ] 

Chandni Singh commented on YARN-8867:
-

Tested patch with the following spec:
{code}
{
  "version": "1.0.2",
  "components" :
  [
{
  "name": "sleeper",
  "number_of_containers": 2,
  "launch_command": "sleep 8000",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration" : {
 "files": [
  {
"type": "STATIC",
"src_file": "file1.txt",
"dest_file": "changed_name.txt"
  }
]
   }
}
  ]
}
{code}

The status of the service with {{PENDING}} localization status
{code}
{
"components": [
{
"configuration": {
"env": {},
"files": [
{
"dest_file": "changed_name.txt",
"properties": {},
"src_file": "file1.txt",
"type": "STATIC"
}
],
"properties": {}
},
"containers": [
{
"bare_host": "10.22.8.153",
"component_instance_name": "sleeper-0",
"id": "container_1548273183277_0001_01_02",
"launch_time": 1548275812327,
"localization_statuses": [
{
"dest_file": "changed_name.txt",
"state": "PENDING"
}
],
"state": "RUNNING_BUT_UNREADY"
}
],
"decommissioned_instances": [],
"dependencies": [],
"launch_command": "sleep 8000",
"name": "sleeper",
"number_of_containers": 2,
"quicklinks": [],
"resource": {
"additional": {},
"cpus": 1,
"memory": "256"
},
"restart_policy": "ALWAYS",
"run_privileged_container": false,
"state": "FLEXING"
}
],
"configuration": {
"env": {},
"files": [],
"properties": {}
},
"dependencies": [],
"id": "application_1548273183277_0001",
"kerberos_principal": {},
"lifetime": -1,
"name": "test1",
"quicklinks": {},
"state": "STARTED",
"version": "1.0.2"
}
{code}

Once localization is completed, the status output of the service reflects it:
{code}
{
"components": [
{
"configuration": {
"env": {},
"files": [
{
"dest_file": "changed_name.txt",
"properties": {},
"src_file": "file1.txt",
"type": "STATIC"
}
],
"properties": {}
},
"containers": [
{
"bare_host": "10.22.8.153",
"component_instance_name": "sleeper-0",
"hostname": "HW12119.local",
"id": "container_1548273183277_0001_01_02",
"ip": "10.22.8.153",
"launch_time": 1548275812327,
"localization_statuses": [
{
"dest_file": "changed_name.txt",
"state": "COMPLETED"
}
],
"state": "RUNNING_BUT_UNREADY"
},
{
"bare_host": "10.22.8.153",
"component_instance_name": "sleeper-1",
"hostname": "HW12119.local",
"id": "container_1548273183277_0001_01_03",
"ip": "10.22.8.153",
"launch_time": 1548275813334,
"localization_statuses": [
{
"dest_file": "changed_name.txt",
"state": "COMPLETED"
}
],
"state": "RUNNING_BUT_UNREADY"
}
],
"decommissioned_instances": [],
"dependencies": [],
"launch_command": "sleep 8000",
"name": "sleeper",
"number_of_containers": 2,
"quicklinks": [],
"resource": {
"additional": {},
"cpus": 1,
"memory": "256"
},
"restart_policy": "ALWAYS",
"run_privileged_container": false,
"state": "FLEXING"
}
],
"configuration": {
"env": {},
"files": [],
"properties": {}
},

[jira] [Comment Edited] (YARN-8867) Retrieve the status of resource localization

2019-01-23 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750317#comment-16750317
 ] 

Chandni Singh edited comment on YARN-8867 at 1/23/19 6:46 PM:
--

Uploaded patch 7 which address the last review comments. I still have to test 
the change.


was (Author: csingh):
Uploaded patch 7 which address the last review comments. I still have to test 
change.

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.007.patch, YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8867) Retrieve the status of resource localization

2019-01-23 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8867:

Attachment: YARN-8867.007.patch

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.007.patch, YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9074) Docker container rm command should be executed after stop

2019-01-22 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749296#comment-16749296
 ] 

Chandni Singh commented on YARN-9074:
-

[~uranus] Thanks for the patch. The change looks good. Failure of tests in 
{{TestContainer}} are related though. They need to be updated as well. 

> Docker container rm command should be executed after stop
> -
>
> Key: YARN-9074
> URL: https://issues.apache.org/jira/browse/YARN-9074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
> Attachments: YARN-9074.001.patch, image-2018-12-01-11-36-12-448.png, 
> image-2018-12-01-11-38-18-191.png
>
>
> {code:java}
> @Override
> public void transition(ContainerImpl container, ContainerEvent event) {
> container.setIsReInitializing(false);
> // Set exit code to 0 on success 
> container.exitCode = 0;
> // TODO: Add containerWorkDir to the deletion service.
> if (DockerLinuxContainerRuntime.isDockerContainerRequested(
> container.daemonConf,
> container.getLaunchContext().getEnvironment())) {
> removeDockerContainer(container);
> }
> if (clCleanupRequired) {
> container.dispatcher.getEventHandler().handle(
> new ContainersLauncherEvent(container,
> ContainersLauncherEventType.CLEANUP_CONTAINER));
> }
> container.cleanup();
> }{code}
> Now, when container is finished, NM firstly execute "_docker rm xxx"_  to 
> remove it and this thread is placed in DeletionService. see more in YARN-5366 
> .
> Next, NM will execute "_docker stop_" and "docker kill" command. these tow 
> commands are wrapped up in ContainerCleanup thread and executed by 
> ContainersLauncher. see more in YARN-7644. 
> The above will cause the container's cleanup to be split into two threads. I 
> think we should refactor these code to make all docker container killing 
> process be place in ContainerCleanup thread and "_docker rm_" should be 
> executed last.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9153) Diagnostics in the container status doesn't get reset after re-init

2018-12-19 Thread Chandni Singh (JIRA)
Chandni Singh created YARN-9153:
---

 Summary: Diagnostics in the container status doesn't get reset 
after re-init 
 Key: YARN-9153
 URL: https://issues.apache.org/jira/browse/YARN-9153
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, yarn
Reporter: Chandni Singh
Assignee: Chandni Singh


When a container is reinitialized, its diagnostics are set to a long string - 
"Reinitializing await...". Even after the container starts running, this 
diagnostics is not cleared. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-19 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9126:

Attachment: YARN-9126.003.patch

> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Attachments: YARN-9126.001.patch, YARN-9126.002.patch, 
> YARN-9126.003.patch
>
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-18 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9126:

Attachment: YARN-9126.002.patch

> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Attachments: YARN-9126.001.patch, YARN-9126.002.patch
>
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-18 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Attachment: YARN-9084.002.patch

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9084.001.patch, YARN-9084.002.patch
>
>
> With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
> registry before upgrade. However it is observed that after the container is 
> launched again as part of reinit, the ContainerStatus received from NM has an 
> IP and host even though the container fails as soon as it is launched. 
> On Yarn Service side this results in the component instance transitioning to 
> READY state when it checks just the presence of IP address. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-17 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723580#comment-16723580
 ] 

Chandni Singh commented on YARN-9126:
-

There were 2 changes that caused the issue:
- YARN-7644 : the cleanup of working directory is done asynchronously 
- YARN-8569: this introduced sysfs directory in container's working directory 
which needs to be deleted during cleanup of working directory.

Attached is patch 001. [~eyang] could you please take a look.

> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Attachments: YARN-9126.001.patch
>
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-17 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9126:

Attachment: YARN-9126.001.patch

> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
> Attachments: YARN-9126.001.patch
>
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-17 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723182#comment-16723182
 ] 

Chandni Singh commented on YARN-9126:
-

[~eyang] I think this is because of YARN-7644. 
Before this change, the cleanup of the container working directory was done in 
a blocking way. This change made it non-blocking which is causing the issue.

> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9126) Container reinit always fails in branch-3.2 and trunk

2018-12-17 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh reassigned YARN-9126:
---

Assignee: Chandni Singh

> Container reinit always fails in branch-3.2 and trunk
> -
>
> Key: YARN-9126
> URL: https://issues.apache.org/jira/browse/YARN-9126
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
>
> When upgrading container, container reinitialization always failed with code 
> 33.  This error code means the localizing file already exist while copying 
> resource files.  The container will retry with another container ID, hence 
> the problem is masked.
> Hadoop 3.1.x relaunch logic seem to have some way to prevent this bug from 
> happening.  The same logic might be useful in branch 3.2 and trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9040) LevelDBCacheTimelineStore in ATS 1.5 leaks native memory

2018-12-13 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720512#comment-16720512
 ] 

Chandni Singh commented on YARN-9040:
-

[~tarunparimi] the change looks good to me.
There aren't any existing tests for {{KeyValueBasedTimelineStore}} so any 
changes made to it cannot be verified by unit tests. We should create tests for 
{{KeyValueBasedTimelineStore}} but that doesn't have to be part of this change.

[~rohithsharma] [~eyang] Could you please help review.

> LevelDBCacheTimelineStore in ATS 1.5 leaks native memory
> 
>
> Key: YARN-9040
> URL: https://issues.apache.org/jira/browse/YARN-9040
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.8.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9040.001.patch, YARN-9040.002.patch
>
>
> When LevelDBCacheTimelineStore from YARN-4219 is used as ATS 1.5 entity 
> caching storage, we observe memory leak due to leveldb files even after the 
> fix of YARN-5368 .
> Top output shows 0.024TB (25GB) RES, even though heap size is only 8GB.
>  
>  
> {code:java}
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 25519 yarn 20 0 33.024g 0.024t 41468 S 6.2 26.0 21:07.39 
> /usr/java/default/bin/java -Dproc_timelineserver -Xmx8192m
> {code}
>  
> Lsof shows a lot of open timeline-cache.ldb files which are referenced by 
> ATS, even though are deleted (DEL), since they are not present when listing 
> them .
>  
> {code:java}
> java 25519 yarn DEL REG 253,28 9438452 
> /var/yarn/timeline/timelineEntityGroupId_1542280269959_55569_dag_1542280269959_55569_2-timeline-cache.ldb/07.sst
> java 25519 yarn DEL REG 253,28 9438438 
> /var/yarn/timeline/timelineEntityGroupId_1542280269959_55569_dag_1542280269959_55569_2-timeline-cache.ldb/07.sst
> java 25519 yarn DEL REG 253,28 9438437 
> /var/yarn/timeline/timelineEntityGroupId_1542280269959_55569_dag_1542280269959_55569_2-timeline-cache.ldb/05.sst
> {code}
>  
> Looks like LevelDBCacheTimelineStore is not closing these files as the 
> LevelDB DBIterator is not closed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-10 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Attachment: YARN-9084.001.patch

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9084.001.patch
>
>
> With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
> registry before upgrade. However it is observed that after the container is 
> launched again as part of reinit, the ContainerStatus received from NM has an 
> IP and host even though the container fails as soon as it is launched. 
> On Yarn Service side this results in the component instance transitioning to 
> READY state when it checks just the presence of IP address. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-10 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Attachment: (was: YARN-9084.001.patch)

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
> registry before upgrade. However it is observed that after the container is 
> launched again as part of reinit, the ContainerStatus received from NM has an 
> IP and host even though the container fails as soon as it is launched. 
> On Yarn Service side this results in the component instance transitioning to 
> READY state when it checks just the presence of IP address. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-10 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Attachment: YARN-9084.001.patch

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
> registry before upgrade. However it is observed that after the container is 
> launched again as part of reinit, the ContainerStatus received from NM has an 
> IP and host even though the container fails as soon as it is launched. 
> On Yarn Service side this results in the component instance transitioning to 
> READY state when it checks just the presence of IP address. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-10 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Attachment: (was: YARN-9084.001.patch)

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
> registry before upgrade. However it is observed that after the container is 
> launched again as part of reinit, the ContainerStatus received from NM has an 
> IP and host even though the container fails as soon as it is launched. 
> On Yarn Service side this results in the component instance transitioning to 
> READY state when it checks just the presence of IP address. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-10 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Attachment: YARN-9084.001.patch

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-9084.001.patch
>
>
> With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
> registry before upgrade. However it is observed that after the container is 
> launched again as part of reinit, the ContainerStatus received from NM has an 
> IP and host even though the container fails as soon as it is launched. 
> On Yarn Service side this results in the component instance transitioning to 
> READY state when it checks just the presence of IP address. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-10 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Description: 
With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
registry before upgrade. However it is observed that after the container is 
launched again as part of reinit, the ContainerStatus received from NM has an 
IP and host even though the container fails as soon as it is launched. 

On Yarn Service side this results in the component instance transitioning to 
READY state when it checks just the presence of IP address. 



> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> With YARN-9071 we do clear the IP address and Hostname from AM, NM and yarn 
> registry before upgrade. However it is observed that after the container is 
> launched again as part of reinit, the ContainerStatus received from NM has an 
> IP and host even though the container fails as soon as it is launched. 
> On Yarn Service side this results in the component instance transitioning to 
> READY state when it checks just the presence of IP address. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-07 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Description: This seems to be happening 

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> This seems to be happening 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-07 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9084:

Description: (was: This seems to be happening )

> Service Upgrade: With default readiness check, the status of upgrade is 
> reported to be successful prematurely
> -
>
> Key: YARN-9084
> URL: https://issues.apache.org/jira/browse/YARN-9084
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-05 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710746#comment-16710746
 ] 

Chandni Singh commented on YARN-9071:
-

Hi [~eyang],

Would like this to be backported to branch-3.1 and branch-3.2. 

Thanks,
Chandni

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch, YARN-9071.005.patch, 
> YARN-9071.006.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9084) Service Upgrade: With default readiness check, the status of upgrade is reported to be successful prematurely

2018-12-05 Thread Chandni Singh (JIRA)
Chandni Singh created YARN-9084:
---

 Summary: Service Upgrade: With default readiness check, the status 
of upgrade is reported to be successful prematurely
 Key: YARN-9084
 URL: https://issues.apache.org/jira/browse/YARN-9084
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chandni Singh
Assignee: Chandni Singh






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-05 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9071:

Attachment: YARN-9071.006.patch

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch, YARN-9071.005.patch, 
> YARN-9071.006.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-05 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710460#comment-16710460
 ] 

Chandni Singh commented on YARN-9071:
-

Addressed checkstyle warnings in patch 7.

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch, YARN-9071.005.patch, 
> YARN-9071.006.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-04 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709475#comment-16709475
 ] 

Chandni Singh commented on YARN-9071:
-

[~eyang] I have uploaded patch 5 where ip and host is cleared on both the AM 
side and the NM side before upgrade. Please take a look at it. 

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch, YARN-9071.005.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-04 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9071:

Attachment: YARN-9071.005.patch

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch, YARN-9071.005.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-04 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709176#comment-16709176
 ] 

Chandni Singh commented on YARN-9071:
-

As discussed offline, 

[~billie.rinaldi] I created YARN-9082 as a follow-up Jira to remove the delay 
in un-registering a metric.

[~eyang] Will put a fix on the Yarn Service AM side to remove the IP address 
from the registry before reinitialization. Currently the default readiness 
check is for the presence of an IP. This is successful because the IP address 
is present from the previous launch. If we remove the IP Address before the 
reinit, then only when the container is successfully launched, it will go into 
READY state.

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9082) Delay during unregistering metrics is unnecessary

2018-12-04 Thread Chandni Singh (JIRA)
Chandni Singh created YARN-9082:
---

 Summary: Delay during unregistering metrics is unnecessary
 Key: YARN-9082
 URL: https://issues.apache.org/jira/browse/YARN-9082
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chandni Singh
Assignee: Chandni Singh


Discovered while debugging YARN-9071

Quoting [~billie.rinaldi]

{quote}

I looked at YARN-3619, where the unregistration delay was added. It seems like 
this was added because unregistration was performed in getMetrics, which was 
causing a ConcurrentModificationException. However, unregistration was moved 
from getMetrics into the finished method (in the same patch), and this leads me 
to believe that the delay is never needed. I'm inclined to think we should 
remove the delay entirely, but would like to hear other opinions.

{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-03 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707854#comment-16707854
 ] 

Chandni Singh commented on YARN-9071:
-

[~billie.rinaldi] [~eyang] could you please review patch 4.

With patch 4, I can see ip and host being fetched after reinit.
{code}
2018-12-03 13:30:07,024 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(2114)) - Container 
container_e71_1543596052785_0023_01_02 transitioned from 
REINITIALIZING_AWAITING_KILL to SCHEDULED

2018-12-03 13:30:07,025 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:onStopMonitoringContainer(934)) - Stopping 
resource-monitoring for container_e71_1543596052785_0023_01_02

2018-12-03 13:30:07,045 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(2114)) - Container 
container_e71_1543596052785_0023_01_02 transitioned from SCHEDULED to 
RUNNING

2018-12-03 13:30:07,046 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:onStartMonitoringContainer(943)) - Starting 
resource-monitoring for container_e71_1543596052785_0023_01_02

2018-12-03 13:30:12,320 INFO  runtime.DockerLinuxContainerRuntime 
(DockerLinuxContainerRuntime.java:getIpAndHost(1178)) - Docker inspect output 
for container_e71_1543596052785_0023_01_02: 

2018-12-03 13:30:12,320 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:initializeProcessTrees(567)) - 
container_e71_1543596052785_0023_01_02's ip =, and hostname = 

 {code}

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-03 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9071:

Attachment: YARN-9071.004.patch

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch, YARN-9071.004.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-12-03 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707629#comment-16707629
 ] 

Chandni Singh commented on YARN-9071:
-

The lines below in {{initializeProcessTrees}} are throwing an error. There 
isn't any stack which is strange.
{code}
  ContainerMetrics usageMetrics = ContainerMetrics
  .forContainer(containerId, containerMetricsPeriodMs,
  containerMetricsUnregisterDelayMs);
  usageMetrics.recordProcessId(pId);
{code}

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-11-30 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705537#comment-16705537
 ] 

Chandni Singh commented on YARN-9071:
-

There still seems to be an issue with updating the ip on the NM. Once container 
monitoring restarts, there looks like an exception:
{code}
2018-11-30 16:39:44,983 WARN  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(489)) - Uncaught exception in 
ContainersMonitorImpl while monitoring resource of  
container_e71_1543596052785_0019_01_02
{code}
This prevents fetching the IP.


> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-11-30 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9071:

Attachment: YARN-9071.003.patch

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch, 
> YARN-9071.003.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-11-30 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9071:

Attachment: YARN-9071.002.patch

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch, YARN-9071.002.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-11-30 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705033#comment-16705033
 ] 

Chandni Singh commented on YARN-9071:
-

Thanks [~billie.rinaldi]. You are right. I will move it there. 

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-11-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9071:

Attachment: YARN-9071.001.patch

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9071.001.patch
>
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

2018-11-29 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh reassigned YARN-9071:
---

Assignee: Chandni Singh

> NM and service AM don't have updated status for reinitialized containers
> 
>
> Key: YARN-9071
> URL: https://issues.apache.org/jira/browse/YARN-9071
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Chandni Singh
>Priority: Critical
>
> Container resource monitoring is not stopped during the reinitialization 
> process, and this prevents the NM from obtaining updated process tree 
> information when the container starts running again. I observed a 
> reinitialized container go from RUNNING to REINITIALIZING to 
> REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring 
> was then started for a second time, but since the trackingContainers entry 
> had already been initialized for the container, ContainersMonitor skipped 
> finding the new PID and IP for the container. A possible solution would be to 
> stop the container monitoring in the reinitialization process so that the 
> process tree information would be initialized properly when monitoring is 
> restarted. When the same container was stopped by the NM later, the NM did 
> not kill the container, and the service AM received an unexpected event (stop 
> at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8867) Retrieve the status of resource localization

2018-11-28 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702526#comment-16702526
 ] 

Chandni Singh commented on YARN-8867:
-

[~eyang] With Yarn Service, currently we request for resources to be localized 
only while launching a container. Once the component instance is READY, its 
localization status is updated. Since the resources were localized with the 
container launch, its resources have already finished localization. To be able 
to view PENDING localizationStatus, we need the additional capability in YARN 
service to localize resource on demand. 
Plan to do this in an iterative fashion. The current changes in Yarn Service 
were done to easily test this new protocol between AM and NM.



> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8867) Retrieve the status of resource localization

2018-11-28 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702326#comment-16702326
 ] 

Chandni Singh commented on YARN-8867:
-

Uploaded patch 6 to address the javadoc error.
cc [~eyang]

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8867) Retrieve the status of resource localization

2018-11-28 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8867:

Attachment: YARN-8867.006.patch

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.003.patch, YARN-8867.004.patch, YARN-8867.005.patch, 
> YARN-8867.006.patch, YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9056) Yarn Service Upgrade: Instance state changes from UPGRADING to READY without performing a readiness check

2018-11-27 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9056:

Attachment: YARN-9056-branch-3.1.001.patch

> Yarn Service Upgrade: Instance state changes from UPGRADING to READY without 
> performing a readiness check
> -
>
> Key: YARN-9056
> URL: https://issues.apache.org/jira/browse/YARN-9056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9056-branch-3.1.001.patch, YARN-9056.001.patch, 
> YARN-9056.002.patch, YARN-9056.003.patch
>
>
> Currently, when an instance is upgraded, the state of the instance changes to 
> UPGRADING. Once the NM informs AM that upgrade is finished, the state of the 
> instance changes to STABLE.
> The instance state should be changes to STABLE only once readiness check 
> succeeds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9056) Yarn Service Upgrade: Instance state changes from UPGRADING to READY without performing a readiness check

2018-11-27 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701131#comment-16701131
 ] 

Chandni Singh commented on YARN-9056:
-

Thanks [~eyang]. I will provide a patch for branch-3.1 as well

> Yarn Service Upgrade: Instance state changes from UPGRADING to READY without 
> performing a readiness check
> -
>
> Key: YARN-9056
> URL: https://issues.apache.org/jira/browse/YARN-9056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9056.001.patch, YARN-9056.002.patch, 
> YARN-9056.003.patch
>
>
> Currently, when an instance is upgraded, the state of the instance changes to 
> UPGRADING. Once the NM informs AM that upgrade is finished, the state of the 
> instance changes to STABLE.
> The instance state should be changes to STABLE only once readiness check 
> succeeds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9056) Yarn Service Upgrade: Instance state changes from UPGRADING to READY without performing a readiness check

2018-11-27 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701057#comment-16701057
 ] 

Chandni Singh commented on YARN-9056:
-

Patch 3 contains the latest changes

> Yarn Service Upgrade: Instance state changes from UPGRADING to READY without 
> performing a readiness check
> -
>
> Key: YARN-9056
> URL: https://issues.apache.org/jira/browse/YARN-9056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9056.001.patch, YARN-9056.002.patch, 
> YARN-9056.003.patch
>
>
> Currently, when an instance is upgraded, the state of the instance changes to 
> UPGRADING. Once the NM informs AM that upgrade is finished, the state of the 
> instance changes to STABLE.
> The instance state should be changes to STABLE only once readiness check 
> succeeds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9056) Yarn Service Upgrade: Instance state changes from UPGRADING to READY without performing a readiness check

2018-11-27 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-9056:

Attachment: YARN-9056.003.patch

> Yarn Service Upgrade: Instance state changes from UPGRADING to READY without 
> performing a readiness check
> -
>
> Key: YARN-9056
> URL: https://issues.apache.org/jira/browse/YARN-9056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9056.001.patch, YARN-9056.002.patch, 
> YARN-9056.003.patch
>
>
> Currently, when an instance is upgraded, the state of the instance changes to 
> UPGRADING. Once the NM informs AM that upgrade is finished, the state of the 
> instance changes to STABLE.
> The instance state should be changes to STABLE only once readiness check 
> succeeds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9056) Yarn Service Upgrade: Instance state changes from UPGRADING to READY without performing a readiness check

2018-11-27 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700934#comment-16700934
 ] 

Chandni Singh commented on YARN-9056:
-

[~eyang] could you please review patch 2?

> Yarn Service Upgrade: Instance state changes from UPGRADING to READY without 
> performing a readiness check
> -
>
> Key: YARN-9056
> URL: https://issues.apache.org/jira/browse/YARN-9056
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-9056.001.patch, YARN-9056.002.patch
>
>
> Currently, when an instance is upgraded, the state of the instance changes to 
> UPGRADING. Once the NM informs AM that upgrade is finished, the state of the 
> instance changes to STABLE.
> The instance state should be changes to STABLE only once readiness check 
> succeeds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   >