[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime

2019-05-22 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845999#comment-16845999
 ] 

Jim Brennan commented on YARN-9560:
---

[~ebadger] thanks for the update!  I am +1 (non-binding) on patch 005.

> Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
> ---
>
> Key: YARN-9560
> URL: https://issues.apache.org/jira/browse/YARN-9560
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9560.001.patch, YARN-9560.002.patch, 
> YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch
>
>
> Since the new OCI/squashFS/runc runtime will be using a lot of the same code 
> as DockerLinuxContainerRuntime, it would be good to move a bunch of the 
> DockerLinuxContainerRuntime code up a level to an abstract class that both of 
> the runtimes can extend. 
> The new structure will look like:
> {noformat}
> OCIContainerRuntime (abstract class)
>   - DockerLinuxContainerRuntime
>   - FSImageContainerRuntime (name negotiable)
> {noformat}
> This JIRA should only change the structure of the code, not the actual 
> semantics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime

2019-05-20 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844121#comment-16844121
 ] 

Jim Brennan commented on YARN-9560:
---

[~ebadger] Thanks for the patch. Overall this looks like a good restructuring 
to enable the addition of the oci runtime. Some comments:

OCIContainerRuntime.java
 * comment at line 70 - maybe change “Docker containers” to “OCI-compliant 
containers”, or something like that. There are other references to Docker that 
might need to be genericized as well. If you are planning to make these types 
of changes as part of actually adding the oci runtime, that is OK with me.
 * comment: YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE comment was 
removed from DockerLinuxContainerRuntime.java, but was not added to 
OCIContainerRuntime.java.
 * Looks like declaration of ENV_DOCKER_CONTAINER_RUN_OVERRIDE_DISABLE was 
removed from DockerLinuxContainerRuntime.java, and was not added to 
OCIContainerRuntime.java.
 * YARN_SYSFS_PATH also seems to be dropped.
 * Seems like isDockerContainerRequested() will ultimately need to be pushed up 
to OCIContainerRuntime.
 * Removed protected scope from mountReadOnlyPath() should it be private?

DockerLinuxContainerRuntime
 * Removed private scope from addCgroupParentIfRequired()?
 * handleContainerStop(), handleContainerKill(), and handleContainerRemove() 
should be protected?

> Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
> ---
>
> Key: YARN-9560
> URL: https://issues.apache.org/jira/browse/YARN-9560
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9560.001.patch
>
>
> Since the new OCI/squashFS/runc runtime will be using a lot of the same code 
> as DockerLinuxContainerRuntime, it would be good to move a bunch of the 
> DockerLinuxContainerRuntime code up a level to an abstract class that both of 
> the runtimes can extend. 
> The new structure will look like:
> {noformat}
> OCIContainerRuntime (abstract class)
>   - DockerLinuxContainerRuntime
>   - FSImageContainerRuntime (name negotiable)
> {noformat}
> This JIRA should only change the structure of the code, not the actual 
> semantics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime

2019-05-21 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844919#comment-16844919
 ] 

Jim Brennan commented on YARN-9560:
---

Thanks for the updates [~ebadger]!

Some comments on the new patch:
 * Need to fix checkstyle, javadoc, and double-check junit failures
 * ContainerCleanup - remove import for DockerLinuxContainerRuntime
 * GpuResourceHandlerImpl - remove import for DockerLinuxContainerRuntime and 
(nit) move import for OCIContainerRuntime down.
 * DockerLinuxContainerRuntime (nit) - comment for signalContainer references 
OCI-compliant instead of docker.
 * DeviceResourceHandlerImpl - (nit) move import for OCIContainerRuntime down.

> Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
> ---
>
> Key: YARN-9560
> URL: https://issues.apache.org/jira/browse/YARN-9560
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9560.001.patch, YARN-9560.002.patch
>
>
> Since the new OCI/squashFS/runc runtime will be using a lot of the same code 
> as DockerLinuxContainerRuntime, it would be good to move a bunch of the 
> DockerLinuxContainerRuntime code up a level to an abstract class that both of 
> the runtimes can extend. 
> The new structure will look like:
> {noformat}
> OCIContainerRuntime (abstract class)
>   - DockerLinuxContainerRuntime
>   - FSImageContainerRuntime (name negotiable)
> {noformat}
> This JIRA should only change the structure of the code, not the actual 
> semantics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9518) can't use CGroups with YARN in centos7

2019-05-14 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839570#comment-16839570
 ] 

Jim Brennan commented on YARN-9518:
---

[~shurong.mai], ideally we would use the same solution in 2.7 that we used in 
2.8 (YARN-2194).   It doesn't look like that patch applies to 2.7, so more work 
would need to be done to port that approach back to 2.7.  Can you look into 
whether that is possible?

 

 

 

> can't use CGroups with YARN in centos7 
> ---
>
> Key: YARN-9518
> URL: https://issues.apache.org/jira/browse/YARN-9518
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.7
>Reporter: Shurong Mai
>Priority: Major
>  Labels: cgroup, patch
> Attachments: YARN-9518-branch-2.7.7.001.patch, 
> YARN-9518-trunk.001.patch, YARN-9518.patch
>
>
> The os version is centos7. 
> {code:java}
> cat /etc/redhat-release
> CentOS Linux release 7.3.1611 (Core)
> {code}
> When I had set configuration variables  for cgroup with yarn, nodemanager 
> could be start without any matter. But when I ran a job, the job failed with 
> these exceptional nodemanager logs in the end.
> In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as 
> node manager - Is a directory "
> After I analysed, I found the reason. In centos6, the cgroup "cpu" and 
> "cpuacct" subsystem are as follows: 
> {code:java}
> /sys/fs/cgroup/cpu
> /sys/fs/cgroup/cpuacct
> {code}
> But in centos7, as follows:
> {code:java}
> /sys/fs/cgroup/cpu -> cpu,cpuacct
> /sys/fs/cgroup/cpuacct -> cpu,cpuacct
> /sys/fs/cgroup/cpu,cpuacct{code}
> "cpu" and "cpuacct" have merge as "cpu,cpuacct".  "cpu"  and  "cpuacct"  are 
> symbol links. 
> As I look at source code, nodemamager get the cgroup subsystem info by 
> reading /proc/mounts. So It get the cpu and cpuacct subsystem path are also 
> "/sys/fs/cgroup/cpu,cpuacct". 
> The resource description arguments of container-executor is such as follows: 
> {code:java}
> cgroups=/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_01/tasks
> {code}
> There is a comma in the cgroup path, but the comma is separator of multi 
> resource. Therefore, the cgroup path is truncated by container-executor as 
> "/sys/fs/cgroup/cpu" rather than correct cgroup path " 
> /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_01/tasks
>  " and report the error in the log  " Can't open file /sys/fs/cgroup/cpu as 
> node manager - Is a directory "
> Hence I modify the source code and submit a patch. The idea of patch is that 
> nodemanager get the cgroup cpu path as "/sys/fs/cgroup/cpu" rather than 
> "/sys/fs/cgroup/cpu,cpuacct". As a result, the  resource description 
> arguments of container-executor is such as follows: 
> {code:java}
> cgroups=/sys/fs/cgroup/cpu/hadoop-yarn/container_1554210318404_0057_02_01/tasks
> {code}
> Note that there is no comma in the path, and is a valid path because 
> "/sys/fs/cgroup/cpu" is symbol link to "/sys/fs/cgroup/cpu,cpuacct". 
> After applied the patch, the problem is resolved and the job can run 
> successfully.
> The patch is compatible with  cgroup path of history os version such as 
> centos6, centos7 , and universally applicable to cgroup subsystem paths such 
> as cgroup network subsystem as follows:  
> {code:java}
> /sys/fs/cgroup/net_cls -> net_cls,net_prio
> /sys/fs/cgroup/net_prio -> net_cls,net_prio
> /sys/fs/cgroup/net_cls,net_prio{code}
>  
>  
> ##
> {panel:title=exceptional nodemanager logs:}
> 2019-04-19 20:17:20,095 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1554210318404_0042_01_01 transitioned from LOCALIZED 
> to RUNNING
>  2019-04-19 20:17:20,101 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1554210318404_0042_01_01 is : 27
>  2019-04-19 20:17:20,103 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception 
> from container-launch with container ID: container_155421031840
>  4_0042_01_01 and exit code: 27
>  ExitCodeException exitCode=27:
>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>  at org.apache.hadoop.util.Shell.run(Shell.java:482)
>  at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>  at 
> 

[jira] [Commented] (YARN-9561) Add C changes for the new OCI/squashfs/runc runtime

2019-05-20 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844311#comment-16844311
 ] 

Jim Brennan commented on YARN-9561:
---

[~ebadger] thanks for the patch!  I did an extensive review of this C code when 
we added it to our internal 2.8-based branch, and we have been running with it 
since January.   So for this review, I compared our internal code to this code 
to verify the minor changes you had to make to the new files, and then 
concentrated on the diffs from trunk in the modified files.  This code looks 
good to me.

I had one minor nit: in container-executor.c, we will likely need to add a call 
to create_yarn_sysfs() in setup_container_paths(), but we might not need it 
right away.

 

> Add C changes for the new OCI/squashfs/runc runtime
> ---
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new OCI/squashFS/runc runtime. There should 
> be no changes to existing code paths. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9518) can't use CGroups with YARN in centos7

2019-04-29 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829328#comment-16829328
 ] 

Jim Brennan commented on YARN-9518:
---

[~shurong.mai], are you running with the latest code (trunk)?   The patch you 
put up looks like it is based on a version of CgroupsLCEResourcesHandler() from 
before 5/19/2017 (YARN-5301).

Can you verify the problem exists in trunk?

 

 

> can't use CGroups with YARN in centos7 
> ---
>
> Key: YARN-9518
> URL: https://issues.apache.org/jira/browse/YARN-9518
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 2.9.2, 2.8.5, 2.7.7, 3.1.2
>Reporter: Shurong Mai
>Priority: Major
>  Labels: cgroup, patch
> Attachments: YARN-9518.patch
>
>
> The os version is centos7.
>  
> When I had set configuration variables  for cgroup with yarn, nodemanager 
> could be start without any matter. But when I ran a job, the job failed with 
> these exceptional nodemanager logs in the end.
> In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as 
> node manager - Is a directory "
> After I analysed, I found the reason. In centos6, the cgroup "cpu" and 
> "cpuacct" subsystem are as follows: 
> {code:java}
> /sys/fs/cgroup/cpu
> /sys/fs/cgroup/cpuacct
> {code}
> But in centos7, as follows:
> {code:java}
> /sys/fs/cgroup/cpu -> cpu,cpuacct
> /sys/fs/cgroup/cpuacct -> cpu,cpuacct
> /sys/fs/cgroup/cpu,cpuacct{code}
> "cpu" and "cpuacct" have merge as "cpu,cpuacct".  "cpu"  and  "cpuacct"  are 
> symbol links. 
> As I look at source code, nodemamager get the cgroup subsystem info by 
> reading /proc/mounts. So It get the cpu and cpuacct subsystem path are also 
> "/sys/fs/cgroup/cpu,cpuacct". 
> The resource description arguments of container-executor is such as follows: 
> {code:java}
> cgroups=/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_01/tasks
> {code}
> There is a comma in the cgroup path, but the comma is separator of multi 
> resource. Therefore, the cgroup path is truncated as "/sys/fs/cgroup/cpu" 
> rather than correct cgroup path " 
> /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_01/tasks
>  " and report the error in the log  " Can't open file /sys/fs/cgroup/cpu as 
> node manager - Is a directory "
> Hence I modify the source code and submit a patch. The idea of patch is that 
> nodemanager get the cgroup cpu path as "/sys/fs/cgroup/cpu" rather than 
> "/sys/fs/cgroup/cpu,cpuacct". As a result, the  resource description 
> arguments of container-executor is such as follows: 
> {code:java}
> cgroups=/sys/fs/cgroup/cpu/hadoop-yarn/container_1554210318404_0057_02_01/tasks
> {code}
> Note that there is no comma in the path, and is a valid path because 
> "/sys/fs/cgroup/cpu" is symbol link to "/sys/fs/cgroup/cpu,cpuacct". 
> After applied the patch, the problem is resolved and the job can run 
> successfully.
> The patch is universally applicable to cgroup subsystem paths, such as cgroup 
> network subsystem as follows:  
> {code:java}
> /sys/fs/cgroup/net_cls -> net_cls,net_prio
> /sys/fs/cgroup/net_prio -> net_cls,net_prio
> /sys/fs/cgroup/net_cls,net_prio{code}
>  
>  
> ##
> {panel:title=exceptional nodemanager logs:}
> 2019-04-19 20:17:20,095 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1554210318404_0042_01_01 transitioned from LOCALIZED 
> to RUNNING
>  2019-04-19 20:17:20,101 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1554210318404_0042_01_01 is : 27
>  2019-04-19 20:17:20,103 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception 
> from container-launch with container ID: container_155421031840
>  4_0042_01_01 and exit code: 27
>  ExitCodeException exitCode=27:
>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>  at org.apache.hadoop.util.Shell.run(Shell.java:482)
>  at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> 

[jira] [Commented] (YARN-9518) can't use CGroups with YARN in centos7

2019-04-30 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830347#comment-16830347
 ] 

Jim Brennan commented on YARN-9518:
---

[~shurong.mai], your patch needs to be based on branch trunk.   I tried 
applying your branch to my local version of trunk, and it does not apply.   See 
[https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute]

Also see the patch naming convention - it needs to be something like: 
YARN-9518.001.patch to be picked up by the automated tests.

I was not suggesting that this issue was fixed by YARN-5301 - there have been a 
few other changes since then.  It just looks like your current patch is based 
on code from before YARN-5301.

 

> can't use CGroups with YARN in centos7 
> ---
>
> Key: YARN-9518
> URL: https://issues.apache.org/jira/browse/YARN-9518
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0, 2.9.2, 2.8.5, 2.7.7, 3.1.2
>Reporter: Shurong Mai
>Priority: Major
>  Labels: cgroup, patch
> Attachments: YARN-9518.patch
>
>
> The os version is centos7. 
> {code:java}
> cat /etc/redhat-release
> CentOS Linux release 7.3.1611 (Core)
> {code}
> When I had set configuration variables  for cgroup with yarn, nodemanager 
> could be start without any matter. But when I ran a job, the job failed with 
> these exceptional nodemanager logs in the end.
> In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as 
> node manager - Is a directory "
> After I analysed, I found the reason. In centos6, the cgroup "cpu" and 
> "cpuacct" subsystem are as follows: 
> {code:java}
> /sys/fs/cgroup/cpu
> /sys/fs/cgroup/cpuacct
> {code}
> But in centos7, as follows:
> {code:java}
> /sys/fs/cgroup/cpu -> cpu,cpuacct
> /sys/fs/cgroup/cpuacct -> cpu,cpuacct
> /sys/fs/cgroup/cpu,cpuacct{code}
> "cpu" and "cpuacct" have merge as "cpu,cpuacct".  "cpu"  and  "cpuacct"  are 
> symbol links. 
> As I look at source code, nodemamager get the cgroup subsystem info by 
> reading /proc/mounts. So It get the cpu and cpuacct subsystem path are also 
> "/sys/fs/cgroup/cpu,cpuacct". 
> The resource description arguments of container-executor is such as follows: 
> {code:java}
> cgroups=/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_01/tasks
> {code}
> There is a comma in the cgroup path, but the comma is separator of multi 
> resource. Therefore, the cgroup path is truncated by container-executor as 
> "/sys/fs/cgroup/cpu" rather than correct cgroup path " 
> /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_01/tasks
>  " and report the error in the log  " Can't open file /sys/fs/cgroup/cpu as 
> node manager - Is a directory "
> Hence I modify the source code and submit a patch. The idea of patch is that 
> nodemanager get the cgroup cpu path as "/sys/fs/cgroup/cpu" rather than 
> "/sys/fs/cgroup/cpu,cpuacct". As a result, the  resource description 
> arguments of container-executor is such as follows: 
> {code:java}
> cgroups=/sys/fs/cgroup/cpu/hadoop-yarn/container_1554210318404_0057_02_01/tasks
> {code}
> Note that there is no comma in the path, and is a valid path because 
> "/sys/fs/cgroup/cpu" is symbol link to "/sys/fs/cgroup/cpu,cpuacct". 
> After applied the patch, the problem is resolved and the job can run 
> successfully.
> The patch is universally applicable to cgroup subsystem paths, such as cgroup 
> network subsystem as follows:  
> {code:java}
> /sys/fs/cgroup/net_cls -> net_cls,net_prio
> /sys/fs/cgroup/net_prio -> net_cls,net_prio
> /sys/fs/cgroup/net_cls,net_prio{code}
>  
>  
> ##
> {panel:title=exceptional nodemanager logs:}
> 2019-04-19 20:17:20,095 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1554210318404_0042_01_01 transitioned from LOCALIZED 
> to RUNNING
>  2019-04-19 20:17:20,101 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_1554210318404_0042_01_01 is : 27
>  2019-04-19 20:17:20,103 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception 
> from container-launch with container ID: container_155421031840
>  4_0042_01_01 and exit code: 27
>  ExitCodeException exitCode=27:
>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>  at org.apache.hadoop.util.Shell.run(Shell.java:482)
>  at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299)
>  at 
> 

[jira] [Updated] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-07 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9527:
--
Attachment: YARN-9527.002.patch

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-19 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822028#comment-16822028
 ] 

Jim Brennan commented on YARN-9486:
---

[~eyang], why is launch.markLaunched() returning false in this case?

 

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-02 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831903#comment-16831903
 ] 

Jim Brennan commented on YARN-9527:
---

I was able to find a node where the problem was actively happening, so I 
grabbed a heap dump of the nodemanager process and saved off the NM logs. From 
this, I was able to figure out what was happening. This sequence of events 
matches several other logs that we have examined.  Note that this analysis was 
done on our internal version of branch-2.8, but based on code inspection, I 
believe the problem still exists in trunk.

*Sequence of events, with relevant logs:*

Container transitions from NEW to LOCALIZING
{noformat}
2019-04-26 05:24:43,356 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e29_1550394211378_12160590_01_08 transitioned from NEW to 
LOCALIZING
{noformat}
 * ContainerImpl.RequestResourcesTransition
 Sends a ContainerLocalizationRequestEvent to ResourceLocalizationService 
(INIT_CONTAINER_RESOURCES)
 * ResourceLocalizationService.handleInitContainerResources()
 Sends ResourceRequestEvent for each LocalResourceRequest to 
LocalResourcesTrackerImpl (REQUEST)
 in this case, there are 11 resources

*Container transitions from LOCALIZING to KILLING (before we process any of 
these resources in LocalizerTracker)*
{noformat}
2019-04-26 05:24:43,356 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e29_1550394211378_12160590_01_08 transitioned from LOCALIZING to 
KILLING
{noformat}
 * ContainerImpl.KillDuringLocalizationTransition
 container.cleanup()
 collects list of privateRsrcs for this container and send 
ContainerLocalizationCleanup event
 * ResourceLocalizationService.handleCleanupContainerResources()
 ** For each resource, send a ResourceReleaseEvent to LocalResourcesTrackerImpl 
(RELEASE)
 ** LocalizerTracker.cleanupPrivLocalizers() (called directly)

 *** Gets the LocalizerRunner for this container from privLocalizers
 *Because we have not yet handled any LocalizerResourceRequestEvents for this 
container, we don’t find a LocalizerRunner, so we just return*
 ** Deletes the container directories.
 Sends CONTAINER_RESOURCES_CLEANEDUP event to ContainerImpl

LocalResourcesTrackerImpl thread processes event queue
 * LocalResourcesTrackerImpl.handle
 Creates new LocalizedResources and adds them to localrsrc map (state is INIT)
 * LocalizedResource.FetchResourceTransition
 ** Adds this container to refs
 ** Sends LocalizerResourceRequestEvent to LocalizerTracker
 ** State changes to DOWNLOADING

{noformat}
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_common_ws-1.2.27.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_common_grid.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_reporting_cdw_common.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/yjava_http_client-0.3.23.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/jcontrib_degrading_stats_util-0.1.17.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_batch_service_client-1.2.16.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/json-smart-1.0.6.3.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/async-http-client-0.3.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_cdw_cow_loader.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/nct.jar 
transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event 

[jira] [Created] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-02 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-9527:
-

 Summary: Rogue LocalizerRunner/ContainerLocalizer repeatedly 
downloading same file
 Key: YARN-9527
 URL: https://issues.apache.org/jira/browse/YARN-9527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.1.2, 2.8.5
Reporter: Jim Brennan


A rogue ContainerLocalizer can get stuck in a loop continuously downloading the 
same file while generating an "Invalid event: LOCALIZED at LOCALIZED" exception 
on each iteration.  Sometimes this continues long enough that it fills up a 
disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-02 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831883#comment-16831883
 ] 

Jim Brennan commented on YARN-9527:
---

For example, we recently had a case where all of the disks used by yarn were 
full:
{noformat}
Filesystem  1K-blocks   Used Available Use% Mounted on
/dev/sdb4  5776759588 5714378904   4561576 100% /grid/1
/dev/sdd2  5840971776 5775661160   6849008 100% /grid/3
/dev/sdc2  5840971776 5777982304   4527864 100% /grid/2
/dev/sda4  5776759588 5712614448   6326032 100% /grid/0
{noformat}
Upon investigation, we found the NM log full of the “Invalid event: LOCALIZED 
at LOCALIZED” exceptions for a file called creative.data, and we found 2229 
copies of that file in the usercache for the user:
{noformat}
-r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/19/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/100014/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/100024/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100189/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100199/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100214/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100229/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100244/creative.data
…
{noformat}
We had a record of a similar problem reported back in September of 2017.
 I scanned our clusters to see how often this was happening. On some clusters, 
there were a significant number of nodes where this “LOCALIZED at LOCALIZED” 
exception had occurred. For example, on one cluster there were 122 nodes where 
I found that log message, some nodes with a large number:
{noformat}
  12566 node585n18:
  15053 node585n30:
  15819 node262n14:
  36182 node582n24:
  42623 node585n28:
  7 node586n24:
  47380 node588n03:
 234528 node582n01:
 494196 node221n32:
 688038 node221n01:
1210223 node1442n30:
1306207 node194n06:
1331739 node1442n21:
1366933 node588n37:
1718461 node583n22:
2050377 node588n33:
2252679 node287n05:
{noformat}

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Priority: Major
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-03 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-9527:
-

Assignee: Jim Brennan

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-08 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9527:
--
Attachment: YARN-9527.003.patch

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-08 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835623#comment-16835623
 ] 

Jim Brennan commented on YARN-9527:
---

I put up patch 003 to address the checkstyle issues.

 

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime

2019-06-27 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874203#comment-16874203
 ] 

Jim Brennan commented on YARN-9560:
---

I think since these are not static strings - they are determined at runtime by 
calling a method getRuntimeType() - that the current camelCase naming is 
appropriate - it makes it clear to the developer that these strings are NOT 
static - they depend on the class of the runtime.  I think going to extra 
effort to make them static doesn't really make things clearer.


> Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
> ---
>
> Key: YARN-9560
> URL: https://issues.apache.org/jira/browse/YARN-9560
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9560.001.patch, YARN-9560.002.patch, 
> YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, 
> YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, 
> YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch
>
>
> Since the new RuncContainerRuntime will be using a lot of the same code as 
> DockerLinuxContainerRuntime, it would be good to move a bunch of the 
> DockerLinuxContainerRuntime code up a level to an abstract class that both of 
> the runtimes can extend. 
> The new structure will look like:
> {noformat}
> OCIContainerRuntime (abstract class)
>   - DockerLinuxContainerRuntime
>   - RuncContainerRuntime
> {noformat}
> This JIRA should only change the structure of the code, not the actual 
> semantics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-24 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825222#comment-16825222
 ] 

Jim Brennan commented on YARN-9486:
---

[~eyang] I'm not sure I agree.  This suggests that containerAlreadyLaunched has 
not been set yet when we get here.   It seems to me that the bug is in the 
relaunch case - shouldn't we be marking the container launched when we relaunch 
it?   It looks like the ContainerLaunch.relaunchContainer() calls 
prepareForLaunch(), which should set it.  Do you know why this is not happening 
in this case?



> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-24 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825333#comment-16825333
 ] 

Jim Brennan commented on YARN-9486:
---

[~eyang] I am not too familiar with the ContainerRelaunch path, but why is it 
using getLocalPathForRead() ? Doesn't it need to overwrite that file?
ContainerLaunch is using:
{noformat}
  String pidFileSubpath = getPidFileSubpath(appIdStr, containerIdStr);
  pidFilePath = dirsHandler.getLocalPathForWrite(pidFileSubpath);
{noformat}


> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-24 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825404#comment-16825404
 ] 

Jim Brennan commented on YARN-9486:
---

[~eyang]
{quote}
The right logic is probably try to locate it first, if it is not found, then 
create a new path.
{quote}
I agree.  I think it we fix this, we won't need to change the cleanup logic.

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-24 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825465#comment-16825465
 ] 

Jim Brennan commented on YARN-9486:
---

{quote}Patch 003 added the safe guard for missing pid file, and reverted the 
isLaunchCompleted logic. If IOException is thrown by disk health check, it will 
leave containers behind. Is that ok? I feel safer to check isLaunchCompleted 
flag to catch the corner cases, but I understand it may not be helpful in code 
readability.
{quote}
Yeah - really anything that throws before you actually call relaunchContainer() 
will put you in that state - the new call to getLocalPathForWrite() can throw 
IOException as well.
 I don't think it's ok to leave containers behind.

The only option I can think of other than adding the isLaunchCompleted check in 
ContainerCleanup would be to call markLaunched() when you catch an exception in 
ContainerRelaunch.call(). That's a little unexpected, so you'd need to add a 
comment to say we need to mark isLaunched in this case to ensure the original 
container is cleaned up.

My concern about the isLaunchCompleted check is that we always set that in the 
finally clause for ContainerLaunch.call(), so any failure before the 
launchContainer() call will now cause a cleanup where it didn't before (like if 
we fail on the areDisksHealthy() check like you mentioned for the relaunch case.

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch, 
> YARN-9486.003.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> 

[jira] [Commented] (YARN-7848) Force removal of docker containers that do not get removed on first try

2019-04-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814627#comment-16814627
 ] 

Jim Brennan commented on YARN-7848:
---

Thanks [~eyang] for the update!

Some comments:
 In get_docker_rm_command(), we are not setting the return code if the 
add_to_args() for {{"-f"}} fails., and it seems like it would be cleaner to 
structure it as:
{noformat}
ret = add_to_args(args, DOCKER_RM_COMMAND);
if (ret != 0) {
  ret = BUFFER_TOO_SMALL;
  goto free_and_exit;
}
ret = add_to_args(args, "-f");
if (ret != 0) {
  ret = BUFFER_TOO_SMALL;
  goto free_and_exit;
}
ret = add_to_args(args, container_name);
if (ret != 0) {
  ret = BUFFER_TOO_SMALL;
  goto free_and_exit;
}
{noformat}

 (nit) In remove_docker_container(), if you wanted to minimize the changes, you 
could have kept {{start_index}} and just set {{args[1] = argv[start_index];}} 
after the if.

In remove_docker_container(), I don't think it is appropriate to use 
free_values(args).  The values are on the stack, not the heap.  You do want to 
do a free(args) to free the array of pointers you allocated.
I think you need to do this free in both the child and the parent.



> Force removal of docker containers that do not get removed on first try
> ---
>
> Key: YARN-7848
> URL: https://issues.apache.org/jira/browse/YARN-7848
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-7848.001.patch, YARN-7848.002.patch, 
> YARN-7848.003.patch
>
>
> After the addition of YARN-5366, containers will get removed after a certain 
> debug delay. However, this is a one-time effort. If the removal fails for 
> whatever reason, the container will persist. We need to add a mechanism for a 
> forced removal of those containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-25 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826101#comment-16826101
 ] 

Jim Brennan commented on YARN-9486:
---

{quote}
As the result, we need to check both markedLaunched and isLaunchCompleted to 
get a better picture if the contained failed to launch, still running, or has 
not started at all.
{quote}
[~eyang] Thanks again for the follow-up.   I agree that adding the 
isLaunchCompleted check is warranted to cover all cases.
It might be helpful to add a comment about the relaunch case where a 
containerAlreadyLaunched is false but isCompleted is true, which seems 
counter-intuitive.

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch, 
> YARN-9486.003.patch, YARN-9486.004.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-25 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826142#comment-16826142
 ] 

Jim Brennan commented on YARN-9486:
---

[~eyang], I am +1 (non-binding) on patch 004.

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch, 
> YARN-9486.003.patch, YARN-9486.004.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-25 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826198#comment-16826198
 ] 

Jim Brennan commented on YARN-9486:
---

[~eyang] thanks for updating the comment.  +1 (non-binding) on patch 005.

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch, 
> YARN-9486.003.patch, YARN-9486.004.patch, YARN-9486.005.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-22 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823309#comment-16823309
 ] 

Jim Brennan commented on YARN-9486:
---

{quote}
It looks like the problem is in the usage of compareAndSet(false, true);.
{code:java}
  /**
   * Marks the container to be launched only if it was not launched.
   *
   * @return true if successful; false otherwise.
   */
  boolean markLaunched() {
return containerAlreadyLaunched.compareAndSet(false, true);
  }{code}
This will return false if the actual value is not equal to expected value. The 
person who coded this is assuming it will return the value of 
containerAlreadyLaunched.
{quote}
This is why it is negating the return from mark.launched().   !mark.launched() 
will be false if containerAlreadyLaunched was false, and true if it was true.  
In either case, it will be true after this call.  I was accounting for this in 
my comment above.

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting localization status for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: 
> COMPLETE, Capability: , Diagnostics: ..., ExitStatus: 
> -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]
> 2019-04-15 20:42:18,464 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1555111445937_0008_01_07]
> 2019-04-15 20:43:50,476 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Stopping container with container Id: container_1555111445937_0008_01_07
> {code}
> There is no docker rm command performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly

2019-04-22 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823189#comment-16823189
 ] 

Jim Brennan commented on YARN-9486:
---

[~eyang] I'm just trying to understand the logic here.

containerAlreadyLaunched is initiialized as false.

In prepareForLaunch(), it is set to true.

In signalContainer() it is set to true.

So it will be true if we attempted a container launch, or if we have signaled 
it (presumably for killing).

In the containerCleanup thread, we currently have:
{noformat}
boolean alreadyLaunched = !launch.markLaunched();
if (!alreadyLaunched) {
//skip
{noformat}
 
 Which will also set it to true. If it was previously false, then we skip, so 
either it was never launched, or it was signaled.

The patch adds this check:
{noformat}
boolean alreadyLaunched = !launch.markLaunched() ||
launch.isLaunchCompleted();
if (!alreadyLaunched) {
// skip
{noformat}
The completed flag is set after a container returns from launchContainer. So 
basically any container that has fully completed will set alreadyLaunched true 
here.

The part I am not following is how launch.isLaunchCompleted() can ever be true 
while when containerAlreadyLaunched is false? That is the only case that is 
changing here.

In the current code, if containerAlreadyLaunched is false, then 
launch.markLaunched() will return true, so alreadyLaunched will be false and we 
will skip. And if containerAlreadyLaunched is true, then launch.markLaunched() 
will return false, so alreadyLaunched will be true, and we will not skip.

In the patch, if launch.isLaunchCompleted() returns false, then the behavior is 
unchanged.
 If launch.isLaunchCompleted() returns true, it will affect the case where 
containerAlreadyLaunched is false - setting alreadyLaunched to true instead of 
false, and we won't skip.

So the questions remains, how is it that we can have isLaunchCompleted() return 
true while containerAlreadyLaunched is false?

> Docker container exited with failure does not get clean up correctly
> 
>
> Key: YARN-9486
> URL: https://issues.apache.org/jira/browse/YARN-9486
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-9486.001.patch, YARN-9486.002.patch
>
>
> When docker container encounters error and exit prematurely 
> (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we 
> get messages that look like this:
> {code}
> java.io.IOException: Could not find 
> nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid
>  in any of the directories
> 2019-04-15 20:42:16,454 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> RELAUNCHING to EXITED_WITH_FAILURE
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Cleaning up container container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,455 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup:
>  Container container_1555111445937_0008_01_07 not launched. No cleanup 
> needed to be done
> 2019-04-15 20:42:16,455 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase  
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1555111445937_0008
> CONTAINERID=container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1555111445937_0008_01_07 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
>  Removing container_1555111445937_0008_01_07 from application 
> application_1555111445937_0008
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Stopping resource-monitoring for container_1555111445937_0008_01_07
> 2019-04-15 20:42:16,458 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  Considering container container_1555111445937_0008_01_07 for 
> log-aggregation
> 2019-04-15 20:42:16,804 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_1555111445937_0008_01_07
> 2019-04-15 

[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime

2019-06-28 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875124#comment-16875124
 ] 

Jim Brennan commented on YARN-9560:
---

Thanks for all the updates [~ebadger]!  I am also +1 on patch 013 (non-binding).


> Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
> ---
>
> Key: YARN-9560
> URL: https://issues.apache.org/jira/browse/YARN-9560
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>  Labels: Docker
> Attachments: YARN-9560.001.patch, YARN-9560.002.patch, 
> YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, 
> YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, 
> YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, 
> YARN-9560.012.patch, YARN-9560.013.patch
>
>
> Since the new RuncContainerRuntime will be using a lot of the same code as 
> DockerLinuxContainerRuntime, it would be good to move a bunch of the 
> DockerLinuxContainerRuntime code up a level to an abstract class that both of 
> the runtimes can extend. 
> The new structure will look like:
> {noformat}
> OCIContainerRuntime (abstract class)
>   - DockerLinuxContainerRuntime
>   - RuncContainerRuntime
> {noformat}
> This JIRA should only change the structure of the code, not the actual 
> semantics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-13 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906463#comment-16906463
 ] 

Jim Brennan commented on YARN-9442:
---

Thanks [~ebadger]!  I will put up a patch for 2.8.

 

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.9.3, 3.1.3, 3.2.2
>
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch, 
> YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9442) container working directory has group read permissions

2019-08-13 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9442:
--
Attachment: YARN-9442-branch-2.8.001.patch

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.9.3, 3.1.3, 3.2.2
>
> Attachments: YARN-9442-branch-2.8.001.patch, YARN-9442.001.patch, 
> YARN-9442.002.patch, YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-13 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906625#comment-16906625
 ] 

Jim Brennan commented on YARN-9442:
---

Thanks [~ebadger]!

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 2.9.3, 3.1.3, 3.2.2
>
> Attachments: YARN-9442-branch-2.8.001.patch, YARN-9442.001.patch, 
> YARN-9442.002.patch, YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9815) ReservationACLsTestBase fails with NPE

2019-09-10 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926851#comment-16926851
 ] 

Jim Brennan commented on YARN-9815:
---

{quote}
Ok sure. I thought to avoid having NULL so the problem does not show up again.
{quote}
In this case,   {{reservationAcls}} is private in ReservationsACLsManager, so 
it's only accessed in two methods, the constructor and this checkAccess method. 
 You could fix it in the constructor, if you want, by checking the return from 
aConf.getReservationAcls and assigning Collections.emptyMap() if it's null.



> ReservationACLsTestBase fails with NPE
> --
>
> Key: YARN-9815
> URL: https://issues.apache.org/jira/browse/YARN-9815
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: YARN-9805.001.patch, YARN-9815.001.patch
>
>
> Running ReservationACLsTestBase throws a NPE running the FairScheduler. Old 
> revisions back in 2016 also throw NPE.
> In the test case, QueueC does not have reserveACLs, so 
> ReservationsACLsManager would throw NPE when it tries to access the ACL on 
> line 82.
> I still could not find what was the first revision that caused this test case 
> to fail. I stopped at bbfaf3c2712c9ba82b0f8423bdeb314bf505a692 which was 
> working fine.
> I have OsX with java 1.8.0_201
>  
> {code:java}
> [ERROR] 
> testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
>   Time elapsed: 1.897 s  <<< ERROR![ERROR] 
> testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
>   Time elapsed: 1.897 s  <<< 
> ERROR!java.lang.NullPointerException:java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ReservationsACLsManager.checkAccess(ReservationsACLsManager.java:83)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.checkReservationACLs(ClientRMService.java:1527)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1290)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitReservation(ApplicationClientProtocolPBServiceImpl.java:511)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:645)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>  at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) 
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitReservation(ApplicationClientProtocolPBClientImpl.java:511)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.submitReservation(ReservationACLsTestBase.java:447)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.verifySubmitReservationSuccess(ReservationACLsTestBase.java:247)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.testApplicationACLs(ReservationACLsTestBase.java:125)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> 

[jira] [Commented] (YARN-9815) ReservationACLsTestBase fails with NPE

2019-09-09 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926099#comment-16926099
 ] 

Jim Brennan commented on YARN-9815:
---

[~ahussein], I think a better solution would be to just add a null check for 
acls in ReservationAclsManager.checkAccess().

 

> ReservationACLsTestBase fails with NPE
> --
>
> Key: YARN-9815
> URL: https://issues.apache.org/jira/browse/YARN-9815
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: YARN-9805.001.patch, YARN-9815.001.patch
>
>
> Running ReservationACLsTestBase throws a NPE running the FairScheduler. Old 
> revisions back in 2016 also throw NPE.
> In the test case, QueueC does not have reserveACLs, so 
> ReservationsACLsManager would throw NPE when it tries to access the ACL on 
> line 82.
> I still could not find what was the first revision that caused this test case 
> to fail. I stopped at bbfaf3c2712c9ba82b0f8423bdeb314bf505a692 which was 
> working fine.
> I have OsX with java 1.8.0_201
>  
> {code:java}
> [ERROR] 
> testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
>   Time elapsed: 1.897 s  <<< ERROR![ERROR] 
> testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
>   Time elapsed: 1.897 s  <<< 
> ERROR!java.lang.NullPointerException:java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ReservationsACLsManager.checkAccess(ReservationsACLsManager.java:83)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.checkReservationACLs(ClientRMService.java:1527)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1290)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitReservation(ApplicationClientProtocolPBServiceImpl.java:511)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:645)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>  at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) 
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitReservation(ApplicationClientProtocolPBClientImpl.java:511)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.submitReservation(ReservationACLsTestBase.java:447)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.verifySubmitReservationSuccess(ReservationACLsTestBase.java:247)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.testApplicationACLs(ReservationACLsTestBase.java:125)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> 

[jira] [Commented] (YARN-9815) ReservationACLsTestBase fails with NPE

2019-09-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927589#comment-16927589
 ] 

Jim Brennan commented on YARN-9815:
---

[~ahussein] I am +1 (non-binding) on patch 002.  [~eepayne] or [~ebadger], if 
you agree, can one of you commit this?

 

> ReservationACLsTestBase fails with NPE
> --
>
> Key: YARN-9815
> URL: https://issues.apache.org/jira/browse/YARN-9815
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: YARN-9805.001.patch, YARN-9815.001.patch, 
> YARN-9815.002.patch
>
>
> Running ReservationACLsTestBase throws a NPE running the FairScheduler. Old 
> revisions back in 2016 also throw NPE.
> In the test case, QueueC does not have reserveACLs, so 
> ReservationsACLsManager would throw NPE when it tries to access the ACL on 
> line 82.
> I still could not find what was the first revision that caused this test case 
> to fail. I stopped at bbfaf3c2712c9ba82b0f8423bdeb314bf505a692 which was 
> working fine.
> I have OsX with java 1.8.0_201
>  
> {code:java}
> [ERROR] 
> testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
>   Time elapsed: 1.897 s  <<< ERROR![ERROR] 
> testApplicationACLs[1](org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase)
>   Time elapsed: 1.897 s  <<< 
> ERROR!java.lang.NullPointerException:java.lang.NullPointerException at 
> org.apache.hadoop.yarn.server.resourcemanager.security.ReservationsACLsManager.checkAccess(ReservationsACLsManager.java:83)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.checkReservationACLs(ClientRMService.java:1527)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitReservation(ClientRMService.java:1290)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitReservation(ApplicationClientProtocolPBServiceImpl.java:511)
>  at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:645)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
>  at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) 
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitReservation(ApplicationClientProtocolPBClientImpl.java:511)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.submitReservation(ReservationACLsTestBase.java:447)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.verifySubmitReservationSuccess(ReservationACLsTestBase.java:247)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ReservationACLsTestBase.testApplicationACLs(ReservationACLsTestBase.java:125)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at 

[jira] [Commented] (YARN-9805) Fine-grained SchedulerNode synchronization

2019-09-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927594#comment-16927594
 ] 

Jim Brennan commented on YARN-9805:
---

[~ahussein], before getting into a detailed review of the changes, I think you 
need to provide more details about what motivated this change, and specifically 
why you think this approach is better than the existing code.  Changing the 
synchronization approach in key components of the system is tricky, and I don't 
think the community is likely to accept this type of change without making a 
convincing case for why it is better.

 

> Fine-grained SchedulerNode synchronization
> --
>
> Key: YARN-9805
> URL: https://issues.apache.org/jira/browse/YARN-9805
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: YARN-9805.001.patch, YARN-9805.002.patch, 
> YARN-9805.003.patch
>
>
> Yarn schedulerNode and RMNode are using synchronized methods on reading and 
> updating the resources.
> Instead, use read-write reentrant locks to provide fine-grained locking and 
> to avoid blocking concurrent reads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9805) Fine-grained SchedulerNode synchronization

2019-09-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927609#comment-16927609
 ] 

Jim Brennan commented on YARN-9805:
---

[~ahussein] here's my initial impression of the AutoCloseableRWLock().   I am 
concerned that some of the methods like release() and isLocked() are adding 
logic to deal with the fact that we don't know if we are operating on the 
readlock or the writelock.   I think a better approach would be to have 
wrappers for the readlock and writelock that implement AutoCloseable, rather 
than trying to do it at the AutoCloseableRWLock() level.

 

> Fine-grained SchedulerNode synchronization
> --
>
> Key: YARN-9805
> URL: https://issues.apache.org/jira/browse/YARN-9805
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: YARN-9805.001.patch, YARN-9805.002.patch, 
> YARN-9805.003.patch
>
>
> Yarn schedulerNode and RMNode are using synchronized methods on reading and 
> updating the resources.
> Instead, use read-write reentrant locks to provide fine-grained locking and 
> to avoid blocking concurrent reads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-07 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902072#comment-16902072
 ] 

Jim Brennan commented on YARN-9442:
---

Thanks [~ebadger].  The current patch no longer applies, so I will put up a new 
one (hopefully later today).

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9442) container working directory has group read permissions

2019-08-07 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9442:
--
Attachment: YARN-9442.003.patch

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch, 
> YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-07 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902405#comment-16902405
 ] 

Jim Brennan commented on YARN-9442:
---

[~ebadger], I've put up a new patch that applies to trunk.

 

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch, 
> YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8045) Reduce log output from container status calls

2019-08-06 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901432#comment-16901432
 ] 

Jim Brennan commented on YARN-8045:
---

The patch for 2.8 looks good to me.  +1 (non-binding)

> Reduce log output from container status calls
> -
>
> Key: YARN-8045
> URL: https://issues.apache.org/jira/browse/YARN-8045
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.0.4, 2.8.6, 2.9.3, 3.1.3
>
> Attachments: YARN-8045.001-branch-2.8.patch, YARN-8045.001.patch
>
>
> Each time a container's status is returned a log entry is produced in the NM 
> from {{ContainerManagerImpl}}. The container status includes the diagnostics 
> field for the container. If the diagnostics field contains an exception, it 
> can appear as if the exception is logged repeatedly every second. The 
> diagnostics message can also span many lines, which puts pressure on the logs 
> and makes it harder to read.
> For example:
> {code}
> 2018-03-17 22:01:11,632 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_e01_1521323860653_0001_01_05
> 2018-03-17 22:01:11,632 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_e01_1521323860653_0001_01_05, ExecutionType: GUARANTEED, State: 
> RUNNING, Capability: , Diagnostics: [2018-03-17 
> 22:01:00.675]Exception from container-launch.
> Container id: container_e01_1521323860653_0001_01_05
> Exit code: -1
> Exception message: 
> Shell ouput: 
> [2018-03-17 22:01:00.750]Diagnostic message from attempt :
> [2018-03-17 22:01:00.750]Container exited with a non-zero exit code -1.
> , ExitStatus: -1, IP: null, Host: null, ContainerSubState: SCHEDULED]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-06 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901443#comment-16901443
 ] 

Jim Brennan commented on YARN-9442:
---

[~eyang], [~ebadger], [~shaneku...@gmail.com], [~jeagles], any further comments 
on this?

 

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-06 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901473#comment-16901473
 ] 

Jim Brennan commented on YARN-9527:
---

We have been running with this patch on one of our large research clusters for 
about a month.  I scanned for this issue again today and there were no 
instances of it.  That is not definitive, but it is a good sign.  We also have 
not had any new problems reported as a result of this change.

I will continue to monitor our clusters for this.

[~ebadger], did you want to see if we can get some other reviewers for this 
patch?

 

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-09 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904109#comment-16904109
 ] 

Jim Brennan commented on YARN-9527:
---

Thanks [~eyang] and [~ebadger]!

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-09 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904132#comment-16904132
 ] 

Jim Brennan commented on YARN-9442:
---

I am actually testing out a change - [~ebadger] and I discussed off-line why we 
need 0710 vs 0700 permissions.   I can't think of a reason why we need execute 
only group permissions.

So I'm testing that out that change and will put up another patch shortly.

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch, 
> YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-09 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904142#comment-16904142
 ] 

Jim Brennan commented on YARN-9442:
---

[~eyang] read permission is needed for directory listing.  execute permissions 
would allow that group to access files in the directory where the files 
themselves have appropriate permissions.

But I think all of the NM setup/access of the working directory is done as a 
privileged operation, so the group read permissions are not needed for that.

 

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch, 
> YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9442) container working directory has group read permissions

2019-08-09 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904218#comment-16904218
 ] 

Jim Brennan commented on YARN-9442:
---

[~eyang], [~ebadger] thanks for the discussion.  I have tested on a test 
cluster with 0700 for the container working directory and that works just fine 
for running my test jobs.

However, I did some poking around in the source base and found one case that I 
think will break if we remove execute permissions - 
ContainerImpl.ResourceLocalizedWhileRunningTransition() is attempting to check 
whether a symbolic link exists in the working directory (for a localized 
resource). I don't think that exists() check will work without execute 
permissions on the container working directory. To actually create the link, we 
will need to use a privileged operation, so I don't think that part would be 
affected.

Given this case (and the potential for others like it), and the fact that 
DefaultContainerExecutor is using 0710, I think we should stick with 0710.

> container working directory has group read permissions
> --
>
> Key: YARN-9442
> URL: https://issues.apache.org/jira/browse/YARN-9442
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9442.001.patch, YARN-9442.002.patch, 
> YARN-9442.003.patch
>
>
> Container working directories are currently created with permissions 0750, 
> owned by the user and with the group set to the node manager group.
> Is there any reason why these directories need group read permissions?
> I have been testing with group read permissions removed and so far I haven't 
> encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9647) Docker launch fails when local-dirs or log-dirs is unhealthy.

2019-07-22 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890503#comment-16890503
 ] 

Jim Brennan commented on YARN-9647:
---

[~ebadger], [~eyang], [~magnum] I think I'm following the discussion and I 
agree with the problem analysis.
{quote}It's slightly more nuanced than this. If the lists don't match the 
container still could've failed because of an invalid mount. Basically if we 
get an invalid mount error then we need to figure out whether that invalid 
mount was in the original allowed-mounts lists in container-executor.cfg. If it 
was, then the error message should indicate a bad disk. Otherwise, the usual 
invalid mount error message should be fine.
{quote}
Do we need to maintain two lists? check_mount_permitted() is already returning 
-1 in the case where the normalize_mount fails for the mount_src before even 
checking if it is permitted. If the disk is bad, I think this is where it will 
fail. I don't think we'll get to the point of checking whether it is permitted? 
Maybe we just need to change this error message:
{noformat}
fprintf(ERRORFILE, "Invalid docker mount '%s', realpath=%s\n", values[i], 
mount_src);
{noformat}
to
{noformat}
fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe bad 
disk?\n", mount_src, values[i]);
{noformat}
Even better, pull the normalizing of mount_src out of check_mount_permitted and 
do it separately.
{noformat}
  char *normalized_path = normalize_mount(mount_src, 0);
  if (normalized_path == NULL) {
  fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe 
bad disk?\n", mount_src, values[i]);
  ret = INVALID_DOCKER_MOUNT;
  goto free_and_exit;
  }
  permitted_rw = check_mount_permitted((const char **) permitted_rw_mounts, 
normalized_path);
  permitted_ro = check_mount_permitted((const char **) permitted_ro_mounts, 
normalized_path);

{noformat}
For paths coming from NM (local dirs / log dirs) it should have already checked 
to ensure bad ones aren't in the list.

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -
>
> Key: YARN-9647
> URL: https://issues.apache.org/jira/browse/YARN-9647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.2
>Reporter: KWON BYUNGCHANG
>Priority: Major
> Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8045) Reduce log output from container status calls

2019-07-26 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893841#comment-16893841
 ] 

Jim Brennan commented on YARN-8045:
---

Would really like to see this pulled back to 2.8 - it looks like it will be 
clean.

> Reduce log output from container status calls
> -
>
> Key: YARN-8045
> URL: https://issues.apache.org/jira/browse/YARN-8045
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8045.001.patch
>
>
> Each time a container's status is returned a log entry is produced in the NM 
> from {{ContainerManagerImpl}}. The container status includes the diagnostics 
> field for the container. If the diagnostics field contains an exception, it 
> can appear as if the exception is logged repeatedly every second. The 
> diagnostics message can also span many lines, which puts pressure on the logs 
> and makes it harder to read.
> For example:
> {code}
> 2018-03-17 22:01:11,632 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Getting container-status for container_e01_1521323860653_0001_01_05
> 2018-03-17 22:01:11,632 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Returning ContainerStatus: [ContainerId: 
> container_e01_1521323860653_0001_01_05, ExecutionType: GUARANTEED, State: 
> RUNNING, Capability: , Diagnostics: [2018-03-17 
> 22:01:00.675]Exception from container-launch.
> Container id: container_e01_1521323860653_0001_01_05
> Exit code: -1
> Exception message: 
> Shell ouput: 
> [2018-03-17 22:01:00.750]Diagnostic message from attempt :
> [2018-03-17 22:01:00.750]Container exited with a non-zero exit code -1.
> , ExitStatus: -1, IP: null, Host: null, ContainerSubState: SCHEDULED]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9846) Use Finer-Grain Synchronization in ResourceLocalizationService

2019-09-20 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934761#comment-16934761
 ] 

Jim Brennan commented on YARN-9846:
---

[~belugabehr] thanks for the patch, but can you provide some background on what 
motivated this change?    It's not clear to me that the new approach is 
actually better in this case.   In the handle() and cleanupPrivLocalizers() 
methods, you are now acquiring two locks instead of one.  And in 
processHeartbeat() we are no longer holding the privLocalizers lock while 
calling the localizer.processHeartbeat() - I'm not sure if that will break 
anything, but the localization code is pretty fragile so I'd be careful.

I personally find the refactoring of LocalizerTracker.handle() to be less 
readable than the original, but that may just be a style issue.

> Use Finer-Grain Synchronization in ResourceLocalizationService
> --
>
> Key: YARN-9846
> URL: https://issues.apache.org/jira/browse/YARN-9846
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
> Attachments: YARN-9846.1.patch, YARN-9846.2.patch, YARN-9846.3.patch
>
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java#L788
> # Remove these synchronization blocks
> # Ensure {{recentlyCleanedLocalizers}} is thread safe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933510#comment-16933510
 ] 

Jim Brennan commented on YARN-9844:
---

Here is a output from running the tests:
{noformat}
mvn test -DRunCapacitySchedulerPerfTests=true -Dtest=TestCapacitySchedulerPerf
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:2.10.0-SNAPSHOT
[WARNING] 
'dependencyManagement.dependencies.dependency.(groupId:artifactId:type:classifier)'
 must be unique: com.microsoft.azure:azure-storage:jar -> version 7.0.0 vs 
5.4.0 @ org.apache.hadoop:hadoop-project:2.10.0-SNAPSHOT, 
/Users/jbrennan02/git/apache-hadoop/hadoop-project/pom.xml, line 1175, column 19
[WARNING] 
[WARNING] It is highly recommended to fix these problems because they threaten 
the stability of your build.
[WARNING] 
[WARNING] For this reason, future Maven versions might no longer support 
building such malformed projects.
[WARNING] 
[INFO] 
[INFO] 
[INFO] Building Apache Hadoop YARN ResourceManager 2.10.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-antrun-plugin:1.7:run (create-testdirs) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Executing tasks


main:
[INFO] Executed tasks
[INFO] 
[INFO] --- hadoop-maven-plugins:2.10.0-SNAPSHOT:protoc (compile-protoc) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Wrote protoc checksums to file 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-maven-plugins-protoc-checksums.json
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/resources
[INFO] Copying 2 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Compiling 3 source files to 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/classes
[INFO] 
[INFO] --- hadoop-maven-plugins:2.10.0-SNAPSHOT:test-protoc 
(compile-test-protoc) @ hadoop-yarn-server-resourcemanager ---
[INFO] Wrote protoc checksums to file 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-maven-plugins-protoc-checksums.json
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 11 resources
[INFO] Copying 1 resource
[INFO] Copying 2 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Compiling 1 source file to 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-classes
[INFO] 
[INFO] --- maven-jar-plugin:2.5:test-jar (default) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] Building jar: 
/Users/jbrennan02/git/apache-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-yarn-server-resourcemanager-2.10.0-SNAPSHOT-tests.jar
[INFO] 
[INFO] --- maven-surefire-plugin:2.21.0:test (default-test) @ 
hadoop-yarn-server-resourcemanager ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf
[ERROR] Tests run: 4, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 42.365 
s <<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf
[ERROR] 
testUserLimitThroughputForFiveResources(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerPerf)
  Time elapsed: 0.038 s  <<< ERROR!
java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:241)
at 
org.apache.hadoop.yarn.api.records.Resource.setResourceValue(Resource.java:351)
at 
org.apache.hadoop.yarn.util.resource.ResourceUtils.getResourceTypesMinimumAllocation(ResourceUtils.java:534)
at 

[jira] [Commented] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933536#comment-16933536
 ] 

Jim Brennan commented on YARN-9844:
---

It appears that I can run these individually, and there are no failures.   Only 
when trying to run the full test do they fail.  Is it possible these tests are 
being run in parallel?

 

> TestCapacitySchedulerPerf test errors in branch-2
> -
>
> Key: YARN-9844
> URL: https://issues.apache.org/jira/browse/YARN-9844
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Priority: Major
>
> These TestCapacitySchedulerPerf throughput tests are failing in branch-2:
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)
Jim Brennan created YARN-9844:
-

 Summary: TestCapacitySchedulerPerf test errors in branch-2
 Key: YARN-9844
 URL: https://issues.apache.org/jira/browse/YARN-9844
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, yarn
Affects Versions: 2.10.0
Reporter: Jim Brennan


**These TestCapacitySchedulerPerf throughput tests are failing in branch-2:

{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9844:
--
Description: 
These TestCapacitySchedulerPerf throughput tests are failing in branch-2:

{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}
{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}
{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}

  was:
**These TestCapacitySchedulerPerf throughput tests are failing in branch-2:

{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}


> TestCapacitySchedulerPerf test errors in branch-2
> -
>
> Key: YARN-9844
> URL: https://issues.apache.org/jira/browse/YARN-9844
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Priority: Major
>
> These TestCapacitySchedulerPerf throughput tests are failing in branch-2:
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}
> {{[ERROR]   
> TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
>  » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9564) Create docker-to-squash tool for image conversion

2019-11-01 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965038#comment-16965038
 ] 

Jim Brennan commented on YARN-9564:
---

Based on my testing with the latest patches for YARN-9561, YARN-9562, and this 
patch, I am +1 (non-binding) on patch 004.  I was able to use docker2squash.py 
to pull a docker image, squash it, and push the layers to my local hdfs.  I was 
then able to run some test jobs using the runc container runtime.

 

> Create docker-to-squash tool for image conversion
> -
>
> Key: YARN-9564
> URL: https://issues.apache.org/jira/browse/YARN-9564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9564.001.patch, YARN-9564.002.patch, 
> YARN-9564.003.patch, YARN-9564.004.patch
>
>
> The new runc runtime uses docker images that are converted into multiple 
> squashfs images. Each layer of the docker image will get its own squashfs 
> image. We need a tool to help automate the creation of these squashfs images 
> when all we have is a docker image



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-01 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965048#comment-16965048
 ] 

Jim Brennan commented on YARN-9562:
---

Thanks for the updates [~ebadger]!  I am +1 (non-binding) on patch 013.   I 
tested it with the patches for YARN-9561 and YARN-9564.  I was able to run with 
the runc container executor on a one node cluster.  I verified that I could use 
the {{YARN_CONTAINER_RUNTIME_RUNC_MOUNTS}} environment variable to specify the 
mounts.   I also ran all of the relevant unit tests.

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-11-01 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965061#comment-16965061
 ] 

Jim Brennan commented on YARN-9561:
---

Thanks for updating the patch [~ebadger]!  I tested this along with the patches 
for YARN-9562 and YARN-9564.  Everything seems to be working well.   I did run 
into one issue with the container executor unit tests (cetest).  I normally 
compile with this option: -Dcontainer-executor.conf.dir=${HADOOP_CONF_DIR}

This causes some failures in cetest:
{noformat}
[--] 7 tests from TestRunc
[ RUN      ] TestRunc.test_parse_runc_launch_cmd_valid
Could not create /home/gs/hadoop/conf/container-executor.cfg
/home/jbrennan02/git/y-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_runc_util.cc:47:
 Failure
      Expected: ret
      Which is: 1
To be equal to: 0
Container executor cfg setup failed


[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_valid (1 ms)
[ RUN      ] TestRunc.test_parse_runc_launch_cmd_bad_container_id
Could not create /home/gs/hadoop/conf/container-executor.cfg
/home/jbrennan02/git/y-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_runc_util.cc:47:
 Failure
      Expected: ret
      Which is: 1
To be equal to: 0
Container executor cfg setup failed


[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_bad_container_id (0 ms)
[ RUN      ] TestRunc.test_parse_runc_launch_cmd_existing_pidfile
Could not create /home/gs/hadoop/conf/container-executor.cfg
/home/jbrennan02/git/y-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_runc_util.cc:47:
 Failure
      Expected: ret
      Which is: 1
To be equal to: 0
Container executor cfg setup failed


[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_existing_pidfile (0 ms)
[ RUN      ] TestRunc.test_parse_runc_launch_cmd_invalid_media_type
Could not create /home/gs/hadoop/conf/container-executor.cfg
/home/jbrennan02/git/y-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_runc_util.cc:47:
 Failure
      Expected: ret
      Which is: 1
To be equal to: 0
Container executor cfg setup failed


[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_invalid_media_type (0 ms)
[ RUN      ] TestRunc.test_parse_runc_launch_cmd_invalid_num_reap_layers_keep
Could not create /home/gs/hadoop/conf/container-executor.cfg
/home/jbrennan02/git/y-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_runc_util.cc:47:
 Failure
      Expected: ret
      Which is: 1
To be equal to: 0
Container executor cfg setup failed


[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_invalid_num_reap_layers_keep 
(0 ms)
[ RUN      ] TestRunc.test_parse_runc_launch_cmd_valid_mounts
Could not create /home/gs/hadoop/conf/container-executor.cfg
/home/jbrennan02/git/y-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_runc_util.cc:47:
 Failure
      Expected: ret
      Which is: 1
To be equal to: 0
Container executor cfg setup failed

[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_valid_mounts (0 ms)
[ RUN      ] TestRunc.test_parse_runc_launch_cmd_invalid_mounts
Could not create /home/gs/hadoop/conf/container-executor.cfg
/home/jbrennan02/git/y-hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/utils/test_runc_util.cc:47:
 Failure
      Expected: ret
      Which is: 1
To be equal to: 0
Container executor cfg setup failed


[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_invalid_mounts (0 ms)
[--] 7 tests from TestRunc (2 ms total)

[--] Global test environment tear-down
[==] 89 tests from 10 test cases ran. (82 ms total)
[  PASSED  ] 82 tests.
[  FAILED  ] 7 tests, listed below:
[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_valid
[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_bad_container_id
[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_existing_pidfile
[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_invalid_media_type
[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_invalid_num_reap_layers_keep
[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_valid_mounts
[  FAILED  ] TestRunc.test_parse_runc_launch_cmd_invalid_mounts
{noformat}
It's failing because I already have a container-executor.cfg there and it is 
owned by root.
 If I run without defining {{container-executor.conf.dir}}, all of the tests 
pass.
 I was able to get this to work by modifying 
test_runc_util.cc::create_ce_file():
{noformat}
        int create_ce_file() {
         

[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-11-01 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965064#comment-16965064
 ] 

Jim Brennan commented on YARN-9561:
---

Actually, it might be better in this case to just do the stat and fail if it 
doesn't exist.

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-11-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966954#comment-16966954
 ] 

Jim Brennan commented on YARN-9561:
---

Thanks for fixing that [~ebadger]!  The change looks good to me, and I verified 
that it works for me with container-executor.conf.dir set.

I am +1 (non-binding) on patch 009.

 

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9564) Create docker-to-squash tool for image conversion

2019-11-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970382#comment-16970382
 ] 

Jim Brennan commented on YARN-9564:
---

Thanks for the updates [~ebadger]!  I am +1 (non-binding) on patch 006.

 

> Create docker-to-squash tool for image conversion
> -
>
> Key: YARN-9564
> URL: https://issues.apache.org/jira/browse/YARN-9564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9564.001.patch, YARN-9564.002.patch, 
> YARN-9564.003.patch, YARN-9564.004.patch, YARN-9564.005.patch, 
> YARN-9564.006.patch
>
>
> The new runc runtime uses docker images that are converted into multiple 
> squashfs images. Each layer of the docker image will get its own squashfs 
> image. We need a tool to help automate the creation of these squashfs images 
> when all we have is a docker image



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970381#comment-16970381
 ] 

Jim Brennan commented on YARN-9562:
---

Thanks for the updates [~ebadger]!  I am +1 (non-binding) on patch 014.

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch, YARN-9562.014.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-11-08 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970380#comment-16970380
 ] 

Jim Brennan commented on YARN-9561:
---

Thanks for the updates [~ebadger]!

A couple comments on the new patch:
 * stat_file_as_nm should ensure that it restores the calling user/group before 
returning.
 * (nit) might want to change the name of stat_file_as_nm - it is not at all 
clear from the name that it will fail if the file exists and succeed if it 
doesn't

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch, YARN-9561.010.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974650#comment-16974650
 ] 

Jim Brennan commented on YARN-9562:
---

[~ebadger] I don't see a patch 015...

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch, YARN-9562.014.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-11-15 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975247#comment-16975247
 ] 

Jim Brennan commented on YARN-9561:
---

I'm +1 (non-binding) on patch 014.

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch, YARN-9561.010.patch, YARN-9561.011.patch, 
> YARN-9561.012.patch, YARN-9561.013.patch, YARN-9561.014.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-15 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975241#comment-16975241
 ] 

Jim Brennan commented on YARN-9562:
---

I'm +1 on patch 015 (non-binding)

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch, YARN-9562.014.patch, 
> YARN-9562.015.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9959) Work around hard-coded tmp and /var/tmp bind-mounts in the container's working directory

2019-11-15 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975253#comment-16975253
 ] 

Jim Brennan commented on YARN-9959:
---

Sounds reasonable to me as well, but we may want to choose a format other than 
{{%WORK_DIR%}}, as this format looks like a Windows environment variable.

 

> Work around hard-coded tmp and /var/tmp bind-mounts in the container's 
> working directory
> 
>
> Key: YARN-9959
> URL: https://issues.apache.org/jira/browse/YARN-9959
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Priority: Major
>
> {noformat}
> addRuncMountLocation(mounts, containerWorkDir.toString() +
> "/private_slash_tmp", "/tmp", true, true);
> addRuncMountLocation(mounts, containerWorkDir.toString() +
> "/private_var_slash_tmp", "/var/tmp", true, true);
> {noformat}
> It would be good to remove the hard-coded tmp mounts from the 
> {{RuncContainerRuntime}} in place of something general or possibly a tmpfs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-11-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968537#comment-16968537
 ] 

Jim Brennan commented on YARN-9562:
---

[~ebadger], [~shaneku...@gmail.com] for the record, I ran with 
{{linux-container-executor.nonsecure-mode.limit-users=false}}

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch, YARN-9562.011.patch, 
> YARN-9562.012.patch, YARN-9562.013.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-28 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961090#comment-16961090
 ] 

Jim Brennan commented on YARN-9914:
---

Thanks [~ebadger]!  I was just about to put up another patch to fix those 
checkstyle warnings but you beat me to it.

 

> Use separate configs for free disk space checking for full and not-full disks
> -
>
> Key: YARN-9914
> URL: https://issues.apache.org/jira/browse/YARN-9914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.8.6, 2.9.3, 3.2.2, 3.1.4, 2.11.0
>
> Attachments: YARN-9914-branch-2.8.001.patch, YARN-9914.001.patch, 
> YARN-9914.002.patch
>
>
> [YARN-3943] added separate configurations for the nodemanager health check 
> disk utilization full disk check:
> {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
> disk full
> {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for 
> marking a full disk as not full.
> On our clusters, we do not use these configs. We instead use 
> {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
> percent of utilization. We have observed the same oscillation behavior as 
> described in [YARN-3943] with this parameter. I would like to add an optional 
> config to specify a separate threshold for marking a full disk as not full:
> {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full
> {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full 
> disk is marked good.
> So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which 
> would cause a disk to be marked full when free space goes below 5GB, and 
> {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
> full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-10-28 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961150#comment-16961150
 ] 

Jim Brennan commented on YARN-9562:
---

[~ebadger], I pulled YARN-9561, YARN-9562, and YARN-9564 into a local branch to 
verify that I could setup and run with the RUNC run-time.
 Overall, things went well, but I did run into a few issues that I will note 
here:
 * in container-executor.cfg, {{module-enabled=true}} in the runc section does 
not work.  You must specify {{feature.runc.enabled=true}}
 I think the is_runc_support_enabled() check should behave like 
is_docker_support_enabled() and check for module.enabled as well.
 * We need to document the need for the 
{{image-tag-to-manifest-plugin.hdfs-hash-file}} property somewhere - you don't 
get very far without it.
 * Do we not support specifying the mounts via environment variables like 
docker does?   I think people will expect that.  I originally tried using 
{{-Dmapreduce.map.env.YARN_CONTAINER_RUNTIME_RUNC_MOUNTS=...}} before realizing 
that I needed to specify default ro/rw mounts in yarn-site.xml.

Once I got past these issues, I was able to run jobs using the RunC container 
run-time on a one-node cluster.

Thanks for all the hard work on this feature!

 

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch, YARN-9562.010.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9906) When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" setting is not valid

2019-10-16 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-9906.
---
Resolution: Invalid

> When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" 
> setting is not  valid
> ---
>
> Key: YARN-9906
> URL: https://issues.apache.org/jira/browse/YARN-9906
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: lynn
>Priority: Major
> Attachments: docker_volume_mounts.patch
>
>
> As 
> [https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html#Application_Submission]
>  described, when I set the item "{{YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" to 
> multi volumes mounts, the value is a comma-separated list of mounts.}}
>  
> {quote}vars="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker,
>  
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro;/etc/hadoop/conf:/etc/hadoop/conf"
>  hadoop jar hadoop-examples.jar pi -Dyarn.app.mapreduce.am.env=$vars \
>  -Dmapreduce.map.env=$vars -Dmapreduce.reduce.env=$vars 10 100{quote}
> I found the docker container can mount the first volume, so it can't be 
> running successfully without report error!
> The code of 
> [DockerLinuxContainerRuntime.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java]
>  as follows:
> {quote}if (environment.containsKey(ENV_DOCKER_CONTAINER_MOUNTS)) {
>   Matcher parsedMounts = USER_MOUNT_PATTERN.matcher(
>   environment.get(ENV_DOCKER_CONTAINER_MOUNTS));
>   if (!parsedMounts.find()) {
> throw new ContainerExecutionException(
> "Unable to parse user supplied mount list: "
> + environment.get(ENV_DOCKER_CONTAINER_MOUNTS));
>   }{quote}
> The regex pattern is in 
> [OCIContainerRuntime|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/OCIContainerRuntime.java]
>  as follows
> {quote}static final Pattern USER_MOUNT_PATTERN = Pattern.compile(
>   "(?<=^|,)([^:\\x00]+):([^:\\x00]+)" +
>   "(:(r[ow]|(r[ow][+])?(r?shared|r?slave|r?private)))?(?:,|$)");{quote}
> it is seperated by comma indeed, but when i read the code of submit the jar 
> to yarn , i find the code 
> [Apps.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Apps.java]
> {quote}private static final Pattern VARVAL_SPLITTER = Pattern.compile(
> "(?<=^|,)"// preceded by ',' or line begin
>   + '(' + Shell.ENV_NAME_REGEX + ')'  // var group
>   + '='
>   + "([^,]*)" // val group
>   );
> {quote}
> It is sepearted by comma as the same.
> So, I just modify the comma to semicolon(";") of the item 
> "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9906) When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" setting is not valid

2019-10-16 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952852#comment-16952852
 ] 

Jim Brennan commented on YARN-9906:
---

[~lynnyuan] please see YARN-8071.   You can specify environment variables 
singly, for example:
{noformat}
-Dmapreduce.map.env.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro{noformat}

> When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" 
> setting is not  valid
> ---
>
> Key: YARN-9906
> URL: https://issues.apache.org/jira/browse/YARN-9906
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: lynn
>Priority: Major
> Attachments: docker_volume_mounts.patch
>
>
> As 
> [https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html#Application_Submission]
>  described, when I set the item "{{YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" to 
> multi volumes mounts, the value is a comma-separated list of mounts.}}
>  
> {quote}vars="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker,
>  
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro;/etc/hadoop/conf:/etc/hadoop/conf"
>  hadoop jar hadoop-examples.jar pi -Dyarn.app.mapreduce.am.env=$vars \
>  -Dmapreduce.map.env=$vars -Dmapreduce.reduce.env=$vars 10 100{quote}
> I found the docker container can mount the first volume, so it can't be 
> running successfully without report error!
> The code of 
> [DockerLinuxContainerRuntime.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java]
>  as follows:
> {quote}if (environment.containsKey(ENV_DOCKER_CONTAINER_MOUNTS)) {
>   Matcher parsedMounts = USER_MOUNT_PATTERN.matcher(
>   environment.get(ENV_DOCKER_CONTAINER_MOUNTS));
>   if (!parsedMounts.find()) {
> throw new ContainerExecutionException(
> "Unable to parse user supplied mount list: "
> + environment.get(ENV_DOCKER_CONTAINER_MOUNTS));
>   }{quote}
> The regex pattern is in 
> [OCIContainerRuntime|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/OCIContainerRuntime.java]
>  as follows
> {quote}static final Pattern USER_MOUNT_PATTERN = Pattern.compile(
>   "(?<=^|,)([^:\\x00]+):([^:\\x00]+)" +
>   "(:(r[ow]|(r[ow][+])?(r?shared|r?slave|r?private)))?(?:,|$)");{quote}
> it is seperated by comma indeed, but when i read the code of submit the jar 
> to yarn , i find the code 
> [Apps.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Apps.java]
> {quote}private static final Pattern VARVAL_SPLITTER = Pattern.compile(
> "(?<=^|,)"// preceded by ',' or line begin
>   + '(' + Shell.ENV_NAME_REGEX + ')'  // var group
>   + '='
>   + "([^,]*)" // val group
>   );
> {quote}
> It is sepearted by comma as the same.
> So, I just modify the comma to semicolon(";") of the item 
> "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-10-24 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959002#comment-16959002
 ] 

Jim Brennan commented on YARN-9561:
---

[~ebadger] it does not look like Hadoop QA tests have run since you updated 
patch 005?

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-24 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958982#comment-16958982
 ] 

Jim Brennan commented on YARN-9914:
---

[~epayne] or [~ebadger] can you please review?

 

> Use separate configs for free disk space checking for full and not-full disks
> -
>
> Key: YARN-9914
> URL: https://issues.apache.org/jira/browse/YARN-9914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9914.001.patch
>
>
> [YARN-3943] added separate configurations for the nodemanager health check 
> disk utilization full disk check:
> {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
> disk full
> {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for 
> marking a full disk as not full.
> On our clusters, we do not use these configs. We instead use 
> {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
> percent of utilization. We have observed the same oscillation behavior as 
> described in [YARN-3943] with this parameter. I would like to add an optional 
> config to specify a separate threshold for marking a full disk as not full:
> {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full
> {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full 
> disk is marked good.
> So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which 
> would cause a disk to be marked full when free space goes below 5GB, and 
> {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
> full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-10-24 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959191#comment-16959191
 ] 

Jim Brennan commented on YARN-9561:
---

Thanks for the updated patch [~ebadger]!

I had a few comments on patch 005:

runc_write_config()
 *  add_std_mounts_json() - should we increase shm to 8GB like we did 
internally?

main.c
 - [BUG] main() - missing {{break}} statement after 
{{RUN_AS_USER_SYNC_YARN_SYSFS}} block.

test_string_utils.cc
 - TEST_F(TestStringUtils, test_strbuf_detach)
this test would be a little better if it moved the buf contents assert to after 
the third append format and/or included a check that sb.buffer != buf (after 
that last append).
 - TEST_F(TestStringUtils, test_strbuf_realloc)
looks like this has  some left over debug std::cout lines?

test_runc_util.cc
 - build_process_struct()
returns true when is fails and false when it succeeds?
 - build_mounts_json()
 is there a reason for three options string arrays?  The first two are 
identical.  Can’t you just have options_ro and options_rw?
 - (nit) unindented line: {{remove(pid_file);}}
 - Would be nice to move these lines into a function - they are repeated a lot:

{noformat}
std::string container_executor_cfg_contents = "[runc]\n  "
                                              
"runc.allowed.rw-mounts=/opt,/var,/usr/bin/cut,/usr/bin/awk\n  "
                                              
"runc.allowed.ro-mounts=/etc/passwd";


ret = setup_container_executor_cfg(container_executor_cfg_contents);
ASSERT_EQ(ret, 0) << "Container executor cfg setup failed\n"; {noformat}

 

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-24 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959256#comment-16959256
 ] 

Jim Brennan commented on YARN-9914:
---

Thanks for the review [~ebadger]!   I've updated patch 002 which renames that 
local variable and some other private/local variables to try to make them a 
little clearer (e.g., diskFreeSpaceCutoff).

 

> Use separate configs for free disk space checking for full and not-full disks
> -
>
> Key: YARN-9914
> URL: https://issues.apache.org/jira/browse/YARN-9914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9914.001.patch, YARN-9914.002.patch
>
>
> [YARN-3943] added separate configurations for the nodemanager health check 
> disk utilization full disk check:
> {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
> disk full
> {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for 
> marking a full disk as not full.
> On our clusters, we do not use these configs. We instead use 
> {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
> percent of utilization. We have observed the same oscillation behavior as 
> described in [YARN-3943] with this parameter. I would like to add an optional 
> config to specify a separate threshold for marking a full disk as not full:
> {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full
> {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full 
> disk is marked good.
> So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which 
> would cause a disk to be marked full when free space goes below 5GB, and 
> {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
> full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-24 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9914:
--
Attachment: YARN-9914.002.patch

> Use separate configs for free disk space checking for full and not-full disks
> -
>
> Key: YARN-9914
> URL: https://issues.apache.org/jira/browse/YARN-9914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9914.001.patch, YARN-9914.002.patch
>
>
> [YARN-3943] added separate configurations for the nodemanager health check 
> disk utilization full disk check:
> {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
> disk full
> {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for 
> marking a full disk as not full.
> On our clusters, we do not use these configs. We instead use 
> {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
> percent of utilization. We have observed the same oscillation behavior as 
> described in [YARN-3943] with this parameter. I would like to add an optional 
> config to specify a separate threshold for marking a full disk as not full:
> {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full
> {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full 
> disk is marked good.
> So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which 
> would cause a disk to be marked full when free space goes below 5GB, and 
> {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
> full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime

2019-10-24 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959220#comment-16959220
 ] 

Jim Brennan commented on YARN-9562:
---

Comment on patch 009:

Nodemanager.java
 * ServiceInit() looks like this line is being re-added (so now it is 
duplicated):

{noformat}
((NMContext) context).setContainerExecutor(exec);{noformat}

> Add Java changes for the new RuncContainerRuntime
> -
>
> Key: YARN-9562
> URL: https://issues.apache.org/jira/browse/YARN-9562
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9562.001.patch, YARN-9562.002.patch, 
> YARN-9562.003.patch, YARN-9562.004.patch, YARN-9562.005.patch, 
> YARN-9562.006.patch, YARN-9562.007.patch, YARN-9562.008.patch, 
> YARN-9562.009.patch
>
>
> This JIRA will be used to add the Java changes for the new 
> RuncContainerRuntime. This will work off of YARN-9560 to use much of the 
> existing DockerLinuxContainerRuntime code once it is moved up into an 
> abstract class that can be extended. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-18 Thread Jim Brennan (Jira)
Jim Brennan created YARN-9914:
-

 Summary: Use separate configs for free disk space checking for 
full and not-full disks
 Key: YARN-9914
 URL: https://issues.apache.org/jira/browse/YARN-9914
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Jim Brennan
Assignee: Jim Brennan


[YARN-3943] added separate configurations for the nodemanager health check disk 
utilization full disk check:

{{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
disk full

{{disk-utilization-watermark-low-per-disk-percentage}} - threshold for marking 
a full disk as not full.

On our clusters, we do not use these configs. We instead use 
{{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
percent of utilization. We have observed the same oscillation behavior as 
described in [YARN-3943] with this parameter. I would like to add an optional 
config to specify a separate threshold for marking a full disk as not full:

{{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full

{{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full disk 
is marked good.

So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which would 
cause a disk to be marked full when free space goes below 5GB, and 
{{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-18 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9914:
--
Attachment: YARN-9914.001.patch

> Use separate configs for free disk space checking for full and not-full disks
> -
>
> Key: YARN-9914
> URL: https://issues.apache.org/jira/browse/YARN-9914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-9914.001.patch
>
>
> [YARN-3943] added separate configurations for the nodemanager health check 
> disk utilization full disk check:
> {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
> disk full
> {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for 
> marking a full disk as not full.
> On our clusters, we do not use these configs. We instead use 
> {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
> percent of utilization. We have observed the same oscillation behavior as 
> described in [YARN-3943] with this parameter. I would like to add an optional 
> config to specify a separate threshold for marking a full disk as not full:
> {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full
> {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full 
> disk is marked good.
> So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which 
> would cause a disk to be marked full when free space goes below 5GB, and 
> {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
> full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9884) Make container-executor mount logic modular

2019-10-17 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953870#comment-16953870
 ] 

Jim Brennan commented on YARN-9884:
---

Thanks for the update [~ebadger]! I am +1 on patch 004 (non-binding).

 

> Make container-executor mount logic modular
> ---
>
> Key: YARN-9884
> URL: https://issues.apache.org/jira/browse/YARN-9884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9884.001.patch, YARN-9884.002.patch, 
> YARN-9884.003.patch, YARN-9884.004.patch
>
>
> The current mount logic in the container-executor is interwined with docker. 
> To avoid duplicating code between docker and runc, the code should be 
> refactored so that both runtimes can use the same common code when possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960057#comment-16960057
 ] 

Jim Brennan commented on YARN-9914:
---

Thanks [~ebadger]! I've attached a patch for branch-2.8.

> Use separate configs for free disk space checking for full and not-full disks
> -
>
> Key: YARN-9914
> URL: https://issues.apache.org/jira/browse/YARN-9914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.9.3, 3.2.2, 3.1.4, 2.11.0
>
> Attachments: YARN-9914-branch-2.8.001.patch, YARN-9914.001.patch, 
> YARN-9914.002.patch
>
>
> [YARN-3943] added separate configurations for the nodemanager health check 
> disk utilization full disk check:
> {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
> disk full
> {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for 
> marking a full disk as not full.
> On our clusters, we do not use these configs. We instead use 
> {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
> percent of utilization. We have observed the same oscillation behavior as 
> described in [YARN-3943] with this parameter. I would like to add an optional 
> config to specify a separate threshold for marking a full disk as not full:
> {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full
> {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full 
> disk is marked good.
> So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which 
> would cause a disk to be marked full when free space goes below 5GB, and 
> {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
> full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-25 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-9914:
--
Attachment: YARN-9914-branch-2.8.001.patch

> Use separate configs for free disk space checking for full and not-full disks
> -
>
> Key: YARN-9914
> URL: https://issues.apache.org/jira/browse/YARN-9914
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Fix For: 2.10.0, 3.0.4, 3.3.0, 2.9.3, 3.2.2, 3.1.4, 2.11.0
>
> Attachments: YARN-9914-branch-2.8.001.patch, YARN-9914.001.patch, 
> YARN-9914.002.patch
>
>
> [YARN-3943] added separate configurations for the nodemanager health check 
> disk utilization full disk check:
> {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
> disk full
> {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for 
> marking a full disk as not full.
> On our clusters, we do not use these configs. We instead use 
> {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
> percent of utilization. We have observed the same oscillation behavior as 
> described in [YARN-3943] with this parameter. I would like to add an optional 
> config to specify a separate threshold for marking a full disk as not full:
> {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full
> {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full 
> disk is marked good.
> So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which 
> would cause a disk to be marked full when free space goes below 5GB, and 
> {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
> full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-11-26 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982720#comment-16982720
 ] 

Jim Brennan commented on YARN-9561:
---

Thanks for the update [~ebadger]!  I downloaded patch 015 and then verified I 
could build nodemanager after doing a clean at the top level.  I then ran 
cetest and test-container-executor.  I then verified I could build from the top 
level as well.

+1 (non-binding) on patch 015.

 

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch, YARN-9561.010.patch, YARN-9561.011.patch, 
> YARN-9561.012.patch, YARN-9561.013.patch, YARN-9561.014.patch, 
> YARN-9561.015.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9730) Support forcing configured partitions to be exclusive based on app node label

2019-09-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938009#comment-16938009
 ] 

Jim Brennan commented on YARN-9730:
---

[~jhung] I believe pulling this back to branch-2 has caused failures in 
TestAppManager (and others).  Example stack trace:
{noformat}
[ERROR] Tests run: 21, Failures: 0, Errors: 7, Skipped: 0, Time elapsed: 7.216 
s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestAppManager
[ERROR] 
testRMAppRetireZeroSetting(org.apache.hadoop.yarn.server.resourcemanager.TestAppManager)
  Time elapsed: 0.054 s  <<< ERROR!
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.RMContextImpl.getExclusiveEnforcedPartitions(RMContextImpl.java:590)
at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.(RMAppManager.java:115)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestAppManager$TestRMAppManager.(TestAppManager.java:192)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestAppManager.testRMAppRetireZeroSetting(TestAppManager.java:450)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:379)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:340)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:413)
{noformat}

> Support forcing configured partitions to be exclusive based on app node label
> -
>
> Key: YARN-9730
> URL: https://issues.apache.org/jira/browse/YARN-9730
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>  Labels: release-blocker
> Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.patch, 
> YARN-9730.002.patch, YARN-9730.003.patch
>
>
> Use case: queue X has all of its workload in non-default (exclusive) 
> partition P (by setting app submission context's node label set to P). Node 
> in partition Q != P heartbeats to RM. Capacity scheduler loops through every 
> application in X, and every scheduler key in this application, and fails to 
> allocate each time since the app's requested label and the node's label don't 
> match. This causes huge performance degradation when number of apps in X is 
> large.
> To fix the issue, allow RM to configure partitions as "forced-exclusive". If 
> partition P is "forced-exclusive", then:
>  * 1a. If app sets its submission context's node label to P, all its resource 
> requests will be overridden to P
>  * 1b. If app sets its submission context's node label Q, any of its resource 
> requests whose labels are 

[jira] [Commented] (YARN-9857) TestDelegationTokenRenewer throws NPE but tests pass

2019-09-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938023#comment-16938023
 ] 

Jim Brennan commented on YARN-9857:
---

+1 This looks good to me. (non-binding)

[~ebadger] what do you think?

 

> TestDelegationTokenRenewer throws NPE but tests pass
> 
>
> Key: YARN-9857
> URL: https://issues.apache.org/jira/browse/YARN-9857
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Minor
> Attachments: YARN-9857.001.patch
>
>
> {{TestDelegationTokenRenewer}} throws some NPEs:
> {code:bash}
> 2019-09-25 12:51:23,446 WARN  [pool-19-thread-2] 
> security.DelegationTokenRenewer 
> (DelegationTokenRenewer.java:handleDTRenewerAppSubmitEvent(945)) - Unable to 
> add the application to the delegation token renewer.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:942)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:918)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2019-09-25 12:51:23,446 DEBUG [main] util.MBeans 
> (MBeans.java:unregister(138)) - Unregistering 
> Hadoop:service=ResourceManager,name=CapacitySchedulerMetrics
> Exception in thread "pool-19-thread-2" java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:951)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:918)
> 2019-09-25 12:51:23,447 DEBUG [main] util.MBeans 
> (MBeans.java:unregister(138)) - Unregistering 
> Hadoop:service=ResourceManager,name=MetricsSystem,sub=Stats
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2019-09-25 12:51:23,447 INFO  [main] impl.MetricsSystemImpl 
> (MetricsSystemImpl.java:stop(216)) - ResourceManager metrics system stopped.
> {code}
> the RMContext dispatcher is not set for the RMMock which results in NPE 
> accessing the event handler of the dispatcher inside 
> {{DelegationTokenRenewer}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9884) Make container-executor mount logic modular

2019-10-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949691#comment-16949691
 ] 

Jim Brennan commented on YARN-9884:
---

[~ebadger] good job on the re-factoring.  This looks pretty good to me.  I was 
going to comment that there are a few of the DOCKER related enum values that 
are no longer used, like INVALID_DOCKER_RO_MOUNT, and those should be removed.  
Also, I think all DOCKER-specific codes should have DOCKER in the name.

I agree with [~eyang] that a single list would be even better.

 

 

> Make container-executor mount logic modular
> ---
>
> Key: YARN-9884
> URL: https://issues.apache.org/jira/browse/YARN-9884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9884.001.patch, YARN-9884.002.patch
>
>
> The current mount logic in the container-executor is interwined with docker. 
> To avoid duplicating code between docker and runc, the code should be 
> refactored so that both runtimes can use the same common code when possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9884) Make container-executor mount logic modular

2019-10-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951032#comment-16951032
 ] 

Jim Brennan commented on YARN-9884:
---

Thanks for the updates [~ebadger]! I am +1 (non-binding) on patch 003.

Some of the *DOCKER* enum values may eventually need to be renamed when we add 
those features to runc, but I think it makes sense to wait until those features 
are implemented and change the enum names at that time.

 

> Make container-executor mount logic modular
> ---
>
> Key: YARN-9884
> URL: https://issues.apache.org/jira/browse/YARN-9884
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-9884.001.patch, YARN-9884.002.patch, 
> YARN-9884.003.patch
>
>
> The current mount logic in the container-executor is interwined with docker. 
> To avoid duplicating code between docker and runc, the code should be 
> refactored so that both runtimes can use the same common code when possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-25 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044743#comment-17044743
 ] 

Jim Brennan commented on YARN-10161:


[~inigoiri] or [~curino] can you please review?


> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10161.001.patch
>
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-24 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10161:
--

 Summary: TestRouterWebServicesREST is corrupting STDOUT
 Key: YARN-10161
 URL: https://issues.apache.org/jira/browse/YARN-10161
 Project: Hadoop YARN
  Issue Type: Test
  Components: yarn
Affects Versions: 2.10.0
Reporter: Jim Brennan


TestRouterWebServicesREST is creating processes that inherit stdin/stdout from 
the current process, so the output from those jobs goes into the standard 
output of mvn test.

Here's an example from a recent build:
{noformat}
[WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
1. See FAQ web page and the dump file 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
[INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.644 
s - in org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
[WARNING] ForkStarter IOException: 506 INFO  [main] 
resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
registered UNIX signal handlers for [TERM, HUP, INT]
876 INFO  [main] conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
found
879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
userToGroupsMap cache
930 INFO  [main] conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
not found
930 INFO  [main] resource.ResourceUtils 
(ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
'resource-types.xml'.
940 INFO  [main] resource.ResourceUtils 
(ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name = 
memory-mb, units = Mi, type = COUNTABLE
940 INFO  [main] resource.ResourceUtils 
(ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name = 
vcores, units = , type = COUNTABLE
974 INFO  [main] conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
yarn-site.xml at 
file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
053 INFO  [main] security.NMTokenSecretManagerInRM 
(NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
8640ms and NMTokenKeyActivationDelay: 90ms
060 INFO  [main] security.RMContainerTokenSecretManager 
(RMContainerTokenSecretManager.java:(79)) - 
ContainerTokenKeyRollingInterval: 8640ms and 
ContainerTokenKeyActivationDelay: 90ms
... {noformat}
It seems like these processes should be rerouting stdout/stderr to a file 
instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2020-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051428#comment-17051428
 ] 

Jim Brennan commented on YARN-2710:
---

Thanks for the patches [~ahussein]!  I have downloaded them and built both 
trunk and branch-2.10.  I am in the process of running all the tests to ensure 
they pass.  One comment/question on the code changes though.  It looks like you 
changed the timeout for all of these tests from 15 secs to 400 secs.  Did it 
really need to be increased that much?


> RM HA tests failed intermittently on trunk
> --
>
> Key: YARN-2710
> URL: https://issues.apache.org/jira/browse/YARN-2710
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
> Environment: Java 8, jenkins
>Reporter: Wangda Tan
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TestResourceTrackerOnHA-output.2.txt, 
> YARN-2710-branch-2.10.001.patch, YARN-2710.001.patch, 
> org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt
>
>
> Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
> TestResourceTrackerOnHA, etc.
> {code}
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
> testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
>   Time elapsed: 9.491 sec  <<< ERROR!
> java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
> to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
>   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
>   at 
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2020-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051446#comment-17051446
 ] 

Jim Brennan commented on YARN-2710:
---

[~ahussein] can you be more specific?   What are you using for retry count  and 
retry delay in that calculation?  is it CLIENT_FAILOVER_MAX_ATTEMPTS (10) and 
waittingForFailOver(), which it looks like waits a max of about 5 secs.   
That's a max of 50 secs, unless I am missing something?

When I ran these locally on my mac, most of the tests took only about 25 secs, 
with the exception of tests in TestApplicationClientProtocolOnHA, several of 
which took about 70 secs.
I was thinking 180 secs might be a more reasonable limit.  And you may want to 
use a different value for TestApplicationClientProtocolOnHA vs the others.


> RM HA tests failed intermittently on trunk
> --
>
> Key: YARN-2710
> URL: https://issues.apache.org/jira/browse/YARN-2710
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
> Environment: Java 8, jenkins
>Reporter: Wangda Tan
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TestResourceTrackerOnHA-output.2.txt, 
> YARN-2710-branch-2.10.001.patch, YARN-2710.001.patch, 
> org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt
>
>
> Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
> TestResourceTrackerOnHA, etc.
> {code}
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
> testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
>   Time elapsed: 9.491 sec  <<< ERROR!
> java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
> to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
>   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
>   at 
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2020-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051460#comment-17051460
 ] 

Jim Brennan commented on YARN-2710:
---

Thanks [~ahussein]!  In the meantime, I have finished running all of these on 
trunk and branch-2.10 with your patch and they all passed.


> RM HA tests failed intermittently on trunk
> --
>
> Key: YARN-2710
> URL: https://issues.apache.org/jira/browse/YARN-2710
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
> Environment: Java 8, jenkins
>Reporter: Wangda Tan
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TestResourceTrackerOnHA-output.2.txt, 
> YARN-2710-branch-2.10.001.patch, YARN-2710.001.patch, 
> org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt
>
>
> Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
> TestResourceTrackerOnHA, etc.
> {code}
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
> testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
>   Time elapsed: 9.491 sec  <<< ERROR!
> java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
> to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
>   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
>   at 
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2020-03-04 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051579#comment-17051579
 ] 

Jim Brennan commented on YARN-2710:
---

Thanks for the update [~ahussein]!  I re-ran the tests on both trunk and 
branch-2.10.
I am +1 (non-binding) on both patch 002s.
I would definitely like to see this committed, as we are seeing these failures 
intermittently in automated testing for our internal branch-2.10 builds.


> RM HA tests failed intermittently on trunk
> --
>
> Key: YARN-2710
> URL: https://issues.apache.org/jira/browse/YARN-2710
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
> Environment: Java 8, jenkins
>Reporter: Wangda Tan
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TestResourceTrackerOnHA-output.2.txt, 
> YARN-2710-branch-2.10.001.patch, YARN-2710-branch-2.10.002.patch, 
> YARN-2710.001.patch, YARN-2710.002.patch, 
> org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt
>
>
> Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
> TestResourceTrackerOnHA, etc.
> {code}
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
> testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
>   Time elapsed: 9.491 sec  <<< ERROR!
> java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
> to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
>   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
>   at 
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9427) TestContainerSchedulerQueuing.testKillOnlyRequiredOpportunisticContainers fails sporadically

2020-03-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049671#comment-17049671
 ] 

Jim Brennan commented on YARN-9427:
---

Thanks for the patches [~ahussein]!  I am +1 (non-binding) on these patches.

> TestContainerSchedulerQueuing.testKillOnlyRequiredOpportunisticContainers 
> fails sporadically
> 
>
> Key: YARN-9427
> URL: https://issues.apache.org/jira/browse/YARN-9427
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler, test
>Affects Versions: 2.10.0, 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: 
> TestContainerSchedulerQueuing.testKillOnlyRequiredOpportunisticContainers, 
> YARN-9427-branch-2.10.001.patch, YARN-9427-branch-2.10.002.patch, 
> YARN-9427.001.patch, YARN-9427.002.patch
>
>
> Failed
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testKillOnlyRequiredOpportunisticContainers
> {code}
> java.lang.AssertionError: expected:<2> but was:<0>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testKillOnlyRequiredOpportunisticContainers(TestContainerSchedulerQueuing.java:1027)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-27 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046713#comment-17046713
 ] 

Jim Brennan commented on YARN-10161:


Thanks for the review [~inigoiri]! 
Just to make sure I understand what you are looking for - currently, patch 001 
is creating:
{noformat}
C02V813GHTDD-lm:target jbrennan02$ ls -l test-dir
total 1176
-rw-r--r--  1 jbrennan02  staff  131284 Feb 25 11:36 
TestRouterWebServicesREST-nm.log
-rw-r--r--  1 jbrennan02  staff  324296 Feb 25 11:36 
TestRouterWebServicesREST-rm.log
-rw-r--r--  1 jbrennan02  staff  115423 Feb 25 11:36 
TestRouterWebServicesREST-router.log
{noformat}
I think you are suggesting that I change it so these files are in 
{{test-dir/processes}}, correct?
I will put up a patch with this change.



> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10161.001.patch
>
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-27 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10161:
---
Attachment: YARN-10161.002.patch

> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10161.001.patch, YARN-10161.002.patch
>
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-27 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046794#comment-17046794
 ] 

Jim Brennan commented on YARN-10161:


patch 003 fixes the whitespace issue.


> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10161.001.patch, YARN-10161.002.patch, 
> YARN-10161.003.patch
>
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-27 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10161:
---
Attachment: YARN-10161.003.patch

> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10161.001.patch, YARN-10161.002.patch, 
> YARN-10161.003.patch
>
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-25 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-10161:
--

Assignee: Jim Brennan

> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-25 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-10161:
---
Affects Version/s: 3.2.1

> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-27 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046970#comment-17046970
 ] 

Jim Brennan commented on YARN-10161:


Thanks [~inigoiri]!  Can you commit this to trunk and other branches?  I've 
verified that the patch applies cleanly to branch-2.10, which is where I 
noticed the problem.


> TestRouterWebServicesREST is corrupting STDOUT
> --
>
> Key: YARN-10161
> URL: https://issues.apache.org/jira/browse/YARN-10161
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: yarn
>Affects Versions: 2.10.0, 3.2.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Minor
> Attachments: YARN-10161.001.patch, YARN-10161.002.patch, 
> YARN-10161.003.patch
>
>
> TestRouterWebServicesREST is creating processes that inherit stdin/stdout 
> from the current process, so the output from those jobs goes into the 
> standard output of mvn test.
> Here's an example from a recent build:
> {noformat}
> [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
> 1. See FAQ web page and the dump file 
> /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
> [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 41.644 s - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [WARNING] ForkStarter IOException: 506 INFO  [main] 
> resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
> 522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
> registered UNIX signal handlers for [TERM, HUP, INT]
> 876 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
> found
> 879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
> userToGroupsMap cache
> 930 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
> not found
> 930 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
> 'resource-types.xml'.
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name 
> = memory-mb, units = Mi, type = COUNTABLE
> 940 INFO  [main] resource.ResourceUtils 
> (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name 
> = vcores, units = , type = COUNTABLE
> 974 INFO  [main] conf.Configuration 
> (Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
> yarn-site.xml at 
> file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
> 001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 053 INFO  [main] security.NMTokenSecretManagerInRM 
> (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
> 8640ms and NMTokenKeyActivationDelay: 90ms
> 060 INFO  [main] security.RMContainerTokenSecretManager 
> (RMContainerTokenSecretManager.java:(79)) - 
> ContainerTokenKeyRollingInterval: 8640ms and 
> ContainerTokenKeyActivationDelay: 90ms
> ... {noformat}
> It seems like these processes should be rerouting stdout/stderr to a file 
> instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk

2020-03-06 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053506#comment-17053506
 ] 

Jim Brennan commented on YARN-2710:
---

[~kihwal] would be good to get this committed.  We are still seeing these 
intermittent failures on internal builds.


> RM HA tests failed intermittently on trunk
> --
>
> Key: YARN-2710
> URL: https://issues.apache.org/jira/browse/YARN-2710
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
> Environment: Java 8, jenkins
>Reporter: Wangda Tan
>Assignee: Ahmed Hussein
>Priority: Major
> Attachments: TestResourceTrackerOnHA-output.2.txt, 
> YARN-2710-branch-2.10.001.patch, YARN-2710-branch-2.10.002.patch, 
> YARN-2710.001.patch, YARN-2710.002.patch, 
> org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt
>
>
> Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
> TestResourceTrackerOnHA, etc.
> {code}
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
> testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
>   Time elapsed: 9.491 sec  <<< ERROR!
> java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
> to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
>   at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
>   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
>   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
>   at 
> org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   >