[jira] [Commented] (YARN-7678) Ability to enable logging of container memory stats

2018-01-03 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310302#comment-16310302
 ] 

Jim Brennan commented on YARN-7678:
---

Will do.  Thanks!

> Ability to enable logging of container memory stats
> ---
>
> Key: YARN-7678
> URL: https://issues.apache.org/jira/browse/YARN-7678
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
> Attachments: YARN-7678.001.patch
>
>
> YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO 
> to DEBUG.
> We have found these log messages to be useful information in Out-of-Memory 
> situations - they provide detail that helps show the memory profile of the 
> container over time, which can be helpful in determining root cause.
> Here's an example message from YARN-3424:
> {noformat}
> 2015-03-27 09:32:48,905 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 9215 for container-id 
> container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory 
> used; 2.6 GB of 2.1 GB virtual memory used
> {noformat}
> Propose to change this to use a separate logger for this message, so that we 
> can enable debug logging for this without enabling all of the other debug 
> logging for ContainersMonitorImpl.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7678) Ability to enable logging of container memory stats

2018-01-03 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-7678:
--
Attachment: YARN-7678-branch-2.001.patch

Providing patch for branch-2.
Repeated manual tests described above for this patch.


> Ability to enable logging of container memory stats
> ---
>
> Key: YARN-7678
> URL: https://issues.apache.org/jira/browse/YARN-7678
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
> Attachments: YARN-7678-branch-2.001.patch, YARN-7678.001.patch
>
>
> YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO 
> to DEBUG.
> We have found these log messages to be useful information in Out-of-Memory 
> situations - they provide detail that helps show the memory profile of the 
> container over time, which can be helpful in determining root cause.
> Here's an example message from YARN-3424:
> {noformat}
> 2015-03-27 09:32:48,905 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 9215 for container-id 
> container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory 
> used; 2.6 GB of 2.1 GB virtual memory used
> {noformat}
> Propose to change this to use a separate logger for this message, so that we 
> can enable debug logging for this without enabling all of the other debug 
> logging for ContainersMonitorImpl.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7678) Ability to enable logging of container memory stats

2018-01-04 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311491#comment-16311491
 ] 

Jim Brennan commented on YARN-7678:
---

The unit test failure in TestContainerSchedulerQueuing is unrelated to this 
change.  I reran that test locally and it still succeeds for me.
I described the testing I've done in an earlier comment.

I think the branch-2 patch is ready for review.


> Ability to enable logging of container memory stats
> ---
>
> Key: YARN-7678
> URL: https://issues.apache.org/jira/browse/YARN-7678
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.8.0, 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
> Attachments: YARN-7678-branch-2.001.patch, YARN-7678.001.patch
>
>
> YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO 
> to DEBUG.
> We have found these log messages to be useful information in Out-of-Memory 
> situations - they provide detail that helps show the memory profile of the 
> container over time, which can be helpful in determining root cause.
> Here's an example message from YARN-3424:
> {noformat}
> 2015-03-27 09:32:48,905 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Memory usage of ProcessTree 9215 for container-id 
> container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory 
> used; 2.6 GB of 2.1 GB virtual memory used
> {noformat}
> Propose to change this to use a separate logger for this message, so that we 
> can enable debug logging for this without enabling all of the other debug 
> logging for ContainersMonitorImpl.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value

2018-06-20 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518171#comment-16518171
 ] 

Jim Brennan commented on YARN-8444:
---

The bad value came from /proc/meminfo - it looks like it returned a negative 
value expressed as an unsigned decimal value, which was too big to parse as a 
long.

> NodeResourceMonitor crashes on bad swapFree value
> -
>
> Key: YARN-8444
> URL: https://issues.apache.org/jira/browse/YARN-8444
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> Saw this on a node that was having difficulty preempting containers. Can't 
> have NodeResourceMonitor exiting. System was above 99% memory used at the 
> time so it may only be something that happens when normal preemption isn't 
> work right, but we should fix since this is a critical monitor to the health 
> of the node.
>  
> {noformat}
> 2018-06-04 14:28:08,539 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for 
> container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB 
> physical memory used; 5.0 GB of 7.3 GB virtual memory used
> 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR 
> yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource 
> Monitor,5,main] threw an Exception.
> java.lang.NumberFormatException: For input string: "18446744073709551596"
>  at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>  at java.lang.Long.parseLong(Long.java:592)
>  at java.lang.Long.parseLong(Long.java:631)
>  at 
> org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
>  at 
> org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
>  at 
> org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
>  at 
> org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
> 2018-06-04 14:28:30,747 
> [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO 
> util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of 
> approximately 9330ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value

2018-06-20 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8444:
--
Description: 
Saw this on a node that was running out of memory. Can't have 
NodeResourceMonitor exiting. System was above 99% memory used at the time, so 
this is not a common occurrence, but we should fix since this is a critical 
monitor to the health of the node.

 
{noformat}
2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 110564 for container-id 
container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory 
used; 5.0 GB of 7.3 GB virtual memory used
2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR 
yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] 
threw an Exception.
java.lang.NumberFormatException: For input string: "18446744073709551596"
 at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:592)
 at java.lang.Long.parseLong(Long.java:631)
 at 
org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
 at 
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
 at 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
2018-06-04 14:28:30,747 
[org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO 
util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of 
approximately 9330ms
{noformat}

  was:
Saw this on a node that was having difficulty preempting containers. Can't have 
NodeResourceMonitor exiting. System was above 99% memory used at the time so it 
may only be something that happens when normal preemption isn't work right, but 
we should fix since this is a critical monitor to the health of the node.

 

{noformat}
2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 110564 for container-id 
container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory 
used; 5.0 GB of 7.3 GB virtual memory used
2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR 
yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] 
threw an Exception.
java.lang.NumberFormatException: For input string: "18446744073709551596"
 at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:592)
 at java.lang.Long.parseLong(Long.java:631)
 at 
org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
 at 
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
 at 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
2018-06-04 14:28:30,747 
[org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO 
util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of 
approximately 9330ms
{noformat}


> NodeResourceMonitor crashes on bad swapFree value
> -
>
> Key: YARN-8444
> URL: https://issues.apache.org/jira/browse/YARN-8444
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> Saw this on a node that was running out of memory. Can't have 
> NodeResourceMonitor exiting. System was above 99% memory used at the time, so 
> this is not a common occurrence, but we should fix since this is a critical 
> monitor to the health of the node.
>  
> {noformat}
> 2018-06-04 14:28:08,539 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for 
> container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB 
> physical memory used; 5.0 GB of 7.3 GB virtual memory used
> 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR 
> yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource 
> Monitor,5,main] threw an Exception.
> java.lang.NumberFormatException: For input string: "18446744073709551596"
>  at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>  at java.lang.Long.parseLong(Long.java:592)
>  at java.lang.Long.parseLong(Long.java:631)
>  at 
> 

[jira] [Created] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value

2018-06-20 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8444:
-

 Summary: NodeResourceMonitor crashes on bad swapFree value
 Key: YARN-8444
 URL: https://issues.apache.org/jira/browse/YARN-8444
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.2, 2.8.3
Reporter: Jim Brennan
Assignee: Jim Brennan


Saw this on a node that was having difficulty preempting containers. Can't have 
NodeResourceMonitor exiting. System was above 99% memory used at the time so it 
may only be something that happens when normal preemption isn't work right, but 
we should fix since this is a critical monitor to the health of the node.

 

{noformat}
2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 110564 for container-id 
container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory 
used; 5.0 GB of 7.3 GB virtual memory used
2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR 
yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] 
threw an Exception.
java.lang.NumberFormatException: For input string: "18446744073709551596"
 at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:592)
 at java.lang.Long.parseLong(Long.java:631)
 at 
org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
 at 
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
 at 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
2018-06-04 14:28:30,747 
[org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO 
util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of 
approximately 9330ms
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value

2018-06-21 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519549#comment-16519549
 ] 

Jim Brennan commented on YARN-8444:
---

[~eepayne], can you please review?

 

> NodeResourceMonitor crashes on bad swapFree value
> -
>
> Key: YARN-8444
> URL: https://issues.apache.org/jira/browse/YARN-8444
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8444.001.patch
>
>
> Saw this on a node that was running out of memory. Can't have 
> NodeResourceMonitor exiting. System was above 99% memory used at the time, so 
> this is not a common occurrence, but we should fix since this is a critical 
> monitor to the health of the node.
>  
> {noformat}
> 2018-06-04 14:28:08,539 [Container Monitor] DEBUG 
> ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for 
> container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB 
> physical memory used; 5.0 GB of 7.3 GB virtual memory used
> 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR 
> yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource 
> Monitor,5,main] threw an Exception.
> java.lang.NumberFormatException: For input string: "18446744073709551596"
>  at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>  at java.lang.Long.parseLong(Long.java:592)
>  at java.lang.Long.parseLong(Long.java:631)
>  at 
> org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
>  at 
> org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
>  at 
> org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
>  at 
> org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
> 2018-06-04 14:28:30,747 
> [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO 
> util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of 
> approximately 9330ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails

2018-08-10 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8640:
--
Attachment: YARN-8640.001.patch

> Restore previous state in container-executor if write_exit_code_file_as_nm 
> fails
> 
>
> Key: YARN-8640
> URL: https://issues.apache.org/jira/browse/YARN-8640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8640.001.patch
>
>
> The container-executor function {{write_exit_code_file_as_nm}} had a number 
> of failure conditions where it just returns -1 without restoring previous 
> state.
> This is not a problem in any of the places where it is currently called, but 
> it could be a problem if future code changes call it before code that depends 
> on the previous state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576916#comment-16576916
 ] 

Jim Brennan commented on YARN-8648:
---

One proposal to fix the leaking cgroups is to have docker put its containers 
directly under the 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}} directory. For 
example, instead of using {{cgroup-parent=/hadoop-yarn/container_id-}}, we use 
{{cgroup-parent=/hadoop-yarn}}. This does cause docker to create a 
{{hadoop-yarn}} cgroup under each resource type, and it does not clean those 
up, but that is just one unused cgroup per resource type vs hundreds of 
thousands.

This can be done by just passing an empty string to 
DockerLinuxContainerRuntime.addCGroupParentIfRequired(), or otherwise changing 
it to ignore the containerIdStr. Doing this and removing the code that 
cherry-picks the PID in container-executor does work, but the NM still creates 
the per-container cgroups as well - they're just not used. The other issue with 
this approach is that the cpu.shares is still updated (to reflect the requested 
vcores allotment) in the per-container cgroup, so it is ignored. In our code, 
we addressed this by passing the cpu.shares value in the docker run 
--cpu-shares command line argument.

I'm still thinking about the best way to address this. Currently most of the 
resourceHandler processing happens at the linuxContainerExecutor level. But 
there is clearly a difference in how cgroups need to be handled for docker vs 
linux cases. In the docker case, we should arguably use docker command line 
arguments instead of directly setting up cgroups.

One option would be to provide a runtime interface useResourceHandlers() which 
for Docker would return false. We could then disable all of the resource 
handling processing that happens in the container executor, and add the 
necessary interfaces to handle cgroup parameters to the docker runtime.

Another option would be to move the resource handler processing down into the 
runtime. This is a bigger change, but may be cleaner. The docker runtime may 
still just ignore those handlers, but that detail would be hidden at the 
container executor level.

cc:, [~ebadger] [~jlowe] [~eyang] [~shaneku...@gmail.com] [~billie.rinaldi]

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker

2018-08-10 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8648:
--
Labels: Docker  (was: )

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576855#comment-16576855
 ] 

Jim Brennan commented on YARN-8648:
---

Another problem we have seen is that container-executor still has code that 
cherry-picks the PID of the launch shell from the docker container and writes 
that into the {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/tasks}} file, 
effectively moving it from 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}} to 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id}}.   So you end up with one 
process out of the container in the {{container_id}} cgroup, and the rest in 
the {{container_id/docker_container_id}} cgroup.



> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6495) check docker container's exit code when writing to cgroup task files

2018-08-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576925#comment-16576925
 ] 

Jim Brennan commented on YARN-6495:
---

As part of YARN-8648, I am proposing that we can just remove the code that this 
patch is fixing.  If we are using cgroups, we are passing the {{cgroup-parent}} 
argument to docker, which accomplishes what this code was trying to do in a 
much more deterministic and reliable way.

My proposal would be to remove this code as part of YARN-8648, but if there is 
a preference for doing that in a separate Jira, I can file a new one.  Assuming 
there is agreement, I think we can close out this Jira.

[~Jaeboo], [~ebadger], do you agree?

> check docker container's exit code when writing to cgroup task files
> 
>
> Key: YARN-6495
> URL: https://issues.apache.org/jira/browse/YARN-6495
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Jaeboo Jeong
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-6495.001.patch, YARN-6495.002.patch
>
>
> If I execute simple command like date on docker container, the application 
> failed to complete successfully.
> for example, 
> {code}
> $ yarn  jar 
> $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
>  -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker -shell_command "date" -jar 
> $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
>  -num_containers 1 -timeout 360
> …
> 17/04/12 00:16:40 INFO distributedshell.Client: Application did finished 
> unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring 
> loop
> 17/04/12 00:16:40 ERROR distributedshell.Client: Application failed to 
> complete successfully
> {code}
> The error log is like below.
> {code}
> ...
> Failed to write pid to file 
> /cgroup_parent/cpu/hadoop-yarn/container_/tasks - No such process
> ...
> {code}
> When writing pid to cgroup tasks, container-executor doesn’t check docker 
> container’s status.
> If the container finished very quickly, we can’t write pid to cgroup tasks, 
> and it is not problem.
> So container-executor needs to check docker container’s exit code during 
> writing pid to cgroup tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8648) Container cgroups are leaked when using docker

2018-08-10 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8648:
-

 Summary: Container cgroups are leaked when using docker
 Key: YARN-8648
 URL: https://issues.apache.org/jira/browse/YARN-8648
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan


When you run with docker and enable cgroups for cpu, docker creates cgroups for 
all resources on the system, not just for cpu.  For instance, if the 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
the nodemanager will create a cgroup for each container under 
{{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path via 
the {{--cgroup-parent}} command line argument.   Docker then creates a cgroup 
for the docker container under that, for instance: 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.

When the container exits, docker cleans up the {{docker_container_id}} cgroup, 
and the nodemanager cleans up the {{container_id}} cgroup,   All is good under 
{{/sys/fs/cgroup/hadoop-yarn}}.

The problem is that docker also creates that same hierarchy under every 
resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these are: 
blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
perf_event, and systemd.So for instance, docker creates 
{{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but it 
only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up the 
{{container_id}} cgroups for these other resources.  On one of our busy 
clusters, we found > 100,000 of these leaked cgroups.

I found this in our 2.8-based version of hadoop, but I have been able to repro 
with current hadoop.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails

2018-08-09 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8640:
-

 Summary: Restore previous state in container-executor if 
write_exit_code_file_as_nm fails
 Key: YARN-8640
 URL: https://issues.apache.org/jira/browse/YARN-8640
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan


The container-executor function {{write_exit_code_file_as_nm}} had a number of 
failure conditions where it just returns -1 without restoring previous state.
This is not a problem in any of the places where it is currently called, but it 
could be a problem if future code changes call it before code that depends on 
the previous state.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8648) Container cgroups are leaked when using docker

2018-08-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576855#comment-16576855
 ] 

Jim Brennan edited comment on YARN-8648 at 8/10/18 9:37 PM:


Another problem we have seen is that container-executor still has code that 
cherry-picks the PID of the launch shell from the docker container and writes 
that into the {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/tasks}} file, 
effectively moving it from 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}} to 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id}}.   So you end up with one 
process out of the container in the {{container_id}} cgroup, and the rest in 
the {{container_id/docker_container_id}} cgroup.

Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
manually write the pid - we can just remove the code that does this.  


was (Author: jim_brennan):
Another problem we have seen is that container-executor still has code that 
cherry-picks the PID of the launch shell from the docker container and writes 
that into the {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/tasks}} file, 
effectively moving it from 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}} to 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id}}.   So you end up with one 
process out of the container in the {{container_id}} cgroup, and the rest in 
the {{container_id/docker_container_id}} cgroup.



> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-13 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8656:
-

 Summary: container-executor should not write cgroup tasks files 
for docker containers
 Key: YARN-8656
 URL: https://issues.apache.org/jira/browse/YARN-8656
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan


If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
run}} to ensure that all processes for the container are placed into a cgroup 
under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
Docker creates a cgroup there with the docker container id as the name and all 
of the processes in the container go into that cgroup.

container-executor has code in {{launch_docker_container_as_user()}} that then 
cherry-picks the PID of the docker container (usually the launch shell) and 
writes that into the 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
moving it from 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with one 
process out of the container in the {{container_id}} cgroup, and the rest in 
the {{container_id/docker_container_id}} cgroup.

Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
manually write the container pid to the tasks file - we can just remove the 
code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6495) check docker container's exit code when writing to cgroup task files

2018-08-13 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578844#comment-16578844
 ] 

Jim Brennan commented on YARN-6495:
---

{quote}My proposal would be to remove this code as part of YARN-8648, but if 
there is a preference for doing that in a separate Jira, I can file a new one.  
Assuming there is agreement, I think we can close out this Jira.
{quote}
I decided to file a new Jira: YARN-8656 for this, rather than lumping it in 
with YARN-8648.

> check docker container's exit code when writing to cgroup task files
> 
>
> Key: YARN-6495
> URL: https://issues.apache.org/jira/browse/YARN-6495
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Jaeboo Jeong
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-6495.001.patch, YARN-6495.002.patch
>
>
> If I execute simple command like date on docker container, the application 
> failed to complete successfully.
> for example, 
> {code}
> $ yarn  jar 
> $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
>  -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker -shell_command "date" -jar 
> $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
>  -num_containers 1 -timeout 360
> …
> 17/04/12 00:16:40 INFO distributedshell.Client: Application did finished 
> unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring 
> loop
> 17/04/12 00:16:40 ERROR distributedshell.Client: Application failed to 
> complete successfully
> {code}
> The error log is like below.
> {code}
> ...
> Failed to write pid to file 
> /cgroup_parent/cpu/hadoop-yarn/container_/tasks - No such process
> ...
> {code}
> When writing pid to cgroup tasks, container-executor doesn’t check docker 
> container’s status.
> If the container finished very quickly, we can’t write pid to cgroup tasks, 
> and it is not problem.
> So container-executor needs to check docker container’s exit code during 
> writing pid to cgroup tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-13 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-8656:
-

Assignee: Jim Brennan

> container-executor should not write cgroup tasks files for docker containers
> 
>
> Key: YARN-8656
> URL: https://issues.apache.org/jira/browse/YARN-8656
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
> run}} to ensure that all processes for the container are placed into a cgroup 
> under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
> Docker creates a cgroup there with the docker container id as the name and 
> all of the processes in the container go into that cgroup.
> container-executor has code in {{launch_docker_container_as_user()}} that 
> then cherry-picks the PID of the docker container (usually the launch shell) 
> and writes that into the 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
> moving it from 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with 
> one process out of the container in the {{container_id}} cgroup, and the rest 
> in the {{container_id/docker_container_id}} cgroup.
> Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
> manually write the container pid to the tasks file - we can just remove the 
> code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-15 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581110#comment-16581110
 ] 

Jim Brennan commented on YARN-8656:
---

I am unable to repro the unit test failure in 
TestContainerManager#testLocalingResourceWhileContainerRunning.   I don't think 
it is related to my change.

> container-executor should not write cgroup tasks files for docker containers
> 
>
> Key: YARN-8656
> URL: https://issues.apache.org/jira/browse/YARN-8656
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8656.001.patch, YARN-8656.002.patch
>
>
> If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
> run}} to ensure that all processes for the container are placed into a cgroup 
> under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
> Docker creates a cgroup there with the docker container id as the name and 
> all of the processes in the container go into that cgroup.
> container-executor has code in {{launch_docker_container_as_user()}} that 
> then cherry-picks the PID of the docker container (usually the launch shell) 
> and writes that into the 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
> moving it from 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with 
> one process out of the container in the {{container_id}} cgroup, and the rest 
> in the {{container_id/docker_container_id}} cgroup.
> Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
> manually write the container pid to the tasks file - we can just remove the 
> code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-14 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580067#comment-16580067
 ] 

Jim Brennan commented on YARN-8648:
---

[~jlowe] thanks for the comment.
{quote}We should consider breaking this up into two JIRAs if it proves 
difficult to hash through the design. It's a relatively small change to move 
the docker containers under the top-level YARN cgroup hierarchy to fixes the 
cgroup leaks, with the side-effect that the NM continues to create and cleanup 
unused cgroups per docker container launched. We could follow up that change 
with another JIRA to resolve the new design for the cgroup / container runtime 
interaction so those empty cgroups are avoided in the docker case. If we can 
hash it out quickly in one JIRA that's great, but I want to make sure the leak 
problem doesn't linger while we work through the architecture of cgroups and 
container runtimes.
 {quote}

The main issue with doing this quick fix for the cgroups leak is that any 
cgroup parameters written by the various resource handlers will be ignored in 
the docker case because they will be written to the unused container cgroup.  
Internally, we added a cpu-shares option to docker to handle the cpu resource 
because that is the only one we're using, but for community I think we need to 
address them all.
Is it worth breaking cgroups parameters temporarily for docker to fix the leak?


> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8640) Restore previous state in container-executor after failure

2018-08-14 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reopened YARN-8640:
---

Reopening so I can provide patches for branch-2.7 and branch-2.8.

> Restore previous state in container-executor after failure
> --
>
> Key: YARN-8640
> URL: https://issues.apache.org/jira/browse/YARN-8640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2
>
> Attachments: YARN-8640.001.patch
>
>
> The container-executor function {{write_exit_code_file_as_nm}} had a number 
> of failure conditions where it just returns -1 without restoring previous 
> state.
> This is not a problem in any of the places where it is currently called, but 
> it could be a problem if future code changes call it before code that depends 
> on the previous state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-14 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580234#comment-16580234
 ] 

Jim Brennan commented on YARN-8648:
---

{quote}I am wondering if this approach would break the docker launching docker 
use case. Currently if you're launching a new docker container from an existing 
docker container, you can have the new container use the same cgroup as the 
first container (e.g. /hadoop-yarn/${CONTAINER_ID}), but if there weren't a 
unique cgroup parent for the container you wouldn't be able to do that. Unless 
there's a way to find out the docker container id from inside the container?
{quote}
 

Thanks [~billie.rinaldi]. Yes I think this use-case would break as you suggest.
{quote}One potential issue with a useResourceHandlers() approach is if the NM 
wants to manipulate cgroup settings on a live container. Having a runtime that 
says it doesn't use resource handlers implies that can't be done by that 
runtime, but it can be supported by the docker runtime
{quote}
Agreed [~jlowe].  I no longer think useResourceHandlers() is a good approach

I don't have a full solution in mind yet, but one question is whether we should 
continue using the per-container cgroup as the cgroup parent for docker.  The 
main advantage to maintaining it is that there is a lot of code that already 
depends on it.  All existing resource handlers just work with this setup.  The 
disadvantage is that it makes fixing the leak harder because docker is creating 
hierarchies under the unused resource types (cpuset, hugetlb, etc...) and it 
creates them as root making it harder for the NM to remove them.

If we use the top-level (hadoop-yarn) as the cgroup parent, then docker cleans 
everything up pretty nicely (although it still leaks the top-level hadoop-yarn 
cgroup for the unused-by-NM resources.  But it breaks the case 
[~billie.rinaldi] mentioned above, and requires that we convert all existing 
resource handlers to use docker command options in the docker case.

One thought I had is adding a dockerCleaupResourceHandler that we tack on to 
the end of the resourceHandlerChain - it's only job would be to clean up the 
extra container cgroups that docker creates.

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8640) Restore previous state in container-executor after failure

2018-08-14 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8640:
--
Attachment: YARN-8640-branch-2.8.001.patch
YARN-8640-branch-2.7.001.patch

> Restore previous state in container-executor after failure
> --
>
> Key: YARN-8640
> URL: https://issues.apache.org/jira/browse/YARN-8640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2
>
> Attachments: YARN-8640-branch-2.7.001.patch, 
> YARN-8640-branch-2.8.001.patch, YARN-8640.001.patch
>
>
> The container-executor function {{write_exit_code_file_as_nm}} had a number 
> of failure conditions where it just returns -1 without restoring previous 
> state.
> This is not a problem in any of the places where it is currently called, but 
> it could be a problem if future code changes call it before code that depends 
> on the previous state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-14 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8656:
--
Attachment: YARN-8656.002.patch

> container-executor should not write cgroup tasks files for docker containers
> 
>
> Key: YARN-8656
> URL: https://issues.apache.org/jira/browse/YARN-8656
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8656.001.patch, YARN-8656.002.patch
>
>
> If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
> run}} to ensure that all processes for the container are placed into a cgroup 
> under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
> Docker creates a cgroup there with the docker container id as the name and 
> all of the processes in the container go into that cgroup.
> container-executor has code in {{launch_docker_container_as_user()}} that 
> then cherry-picks the PID of the docker container (usually the launch shell) 
> and writes that into the 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
> moving it from 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with 
> one process out of the container in the {{container_id}} cgroup, and the rest 
> in the {{container_id/docker_container_id}} cgroup.
> Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
> manually write the container pid to the tasks file - we can just remove the 
> code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8640) Restore previous state in container-executor after failure

2018-08-16 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583053#comment-16583053
 ] 

Jim Brennan commented on YARN-8640:
---

[~jlowe]  I'm not sure what happened here with genericqa?

 

> Restore previous state in container-executor after failure
> --
>
> Key: YARN-8640
> URL: https://issues.apache.org/jira/browse/YARN-8640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2
>
> Attachments: YARN-8640-branch-2.7.001.patch, 
> YARN-8640-branch-2.8.001.patch, YARN-8640.001.patch
>
>
> The container-executor function {{write_exit_code_file_as_nm}} had a number 
> of failure conditions where it just returns -1 without restoring previous 
> state.
> This is not a problem in any of the places where it is currently called, but 
> it could be a problem if future code changes call it before code that depends 
> on the previous state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails

2018-08-13 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578547#comment-16578547
 ] 

Jim Brennan commented on YARN-8640:
---

Tested by running test-container-executor and cetest - both pass with no 
errors.  Also ran sleep and pi jobs on a single node cluster with and without 
docker, and also with NM restart during the jobs.

[~jlowe] please review.

 

> Restore previous state in container-executor if write_exit_code_file_as_nm 
> fails
> 
>
> Key: YARN-8640
> URL: https://issues.apache.org/jira/browse/YARN-8640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8640.001.patch
>
>
> The container-executor function {{write_exit_code_file_as_nm}} had a number 
> of failure conditions where it just returns -1 without restoring previous 
> state.
> This is not a problem in any of the places where it is currently called, but 
> it could be a problem if future code changes call it before code that depends 
> on the previous state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-14 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8656:
--
Attachment: YARN-8656.001.patch

> container-executor should not write cgroup tasks files for docker containers
> 
>
> Key: YARN-8656
> URL: https://issues.apache.org/jira/browse/YARN-8656
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8656.001.patch
>
>
> If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
> run}} to ensure that all processes for the container are placed into a cgroup 
> under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
> Docker creates a cgroup there with the docker container id as the name and 
> all of the processes in the container go into that cgroup.
> container-executor has code in {{launch_docker_container_as_user()}} that 
> then cherry-picks the PID of the docker container (usually the launch shell) 
> and writes that into the 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
> moving it from 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with 
> one process out of the container in the {{container_id}} cgroup, and the rest 
> in the {{container_id/docker_container_id}} cgroup.
> Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
> manually write the container pid to the tasks file - we can just remove the 
> code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-14 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579978#comment-16579978
 ] 

Jim Brennan commented on YARN-8656:
---

I have tested this by running test-container-executor, cetest, and nodemanager 
unit tests.  I've also run some jobs on a single node cluster and manually 
verified that with Docker the single PID is no longer written to the 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file. All PIDs for 
the container appear in the 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id/tasks}} 
file, which is managed by docker.

I do have one question: should I remove the resources_values argument from 
{{launch_docker_container_as_user()}}, since it is no longer used? Could also 
remove it from DockerLinuxContainerRuntime.buildLaunchOp().

[~jlowe], [~ebadger], [~eyang], [~shaneku...@gmail.com] - thoughts?

 

> container-executor should not write cgroup tasks files for docker containers
> 
>
> Key: YARN-8656
> URL: https://issues.apache.org/jira/browse/YARN-8656
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8656.001.patch
>
>
> If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
> run}} to ensure that all processes for the container are placed into a cgroup 
> under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
> Docker creates a cgroup there with the docker container id as the name and 
> all of the processes in the container go into that cgroup.
> container-executor has code in {{launch_docker_container_as_user()}} that 
> then cherry-picks the PID of the docker container (usually the launch shell) 
> and writes that into the 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
> moving it from 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
> {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with 
> one process out of the container in the {{container_id}} cgroup, and the rest 
> in the {{container_id/docker_container_id}} cgroup.
> Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
> manually write the container pid to the tasks file - we can just remove the 
> code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-20 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586060#comment-16586060
 ] 

Jim Brennan commented on YARN-8648:
---

Thanks [~eyang]!  My main concern about the minimal fix is the security aspect, 
since we will need to add an option to container-executor to tell it to delete 
all cgroups with a particular name as root (since docker will create them as 
root).

I think this is mitigated if we use the "cgroup" section of the 
container-exectutor.cfg to constrain it.  This is currently used to enable 
updating params, but I think it could be used for this as well.    It already 
defines the CGROUPS_ROOT (e.g., /sys/fs/cgroup), and the YARN_HIERARCHY (e.g, 
hadoop-yarn).  We could either add another config parameter to define the list 
of hierarchies to clean up (e.g, cpuset, freezer, hugetlb, etc...), or we can 
parse /proc/mounts to determine the full list.  I think it's safer to add the 
config parameter.

I will start working on this version unless there are objections?

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8640) Restore previous state in container-executor after failure

2018-08-17 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8640:
--
Attachment: YARN-8640-branch-2.8.002.patch
YARN-8640-branch-2.7.002.patch

> Restore previous state in container-executor after failure
> --
>
> Key: YARN-8640
> URL: https://issues.apache.org/jira/browse/YARN-8640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2
>
> Attachments: YARN-8640-branch-2.7.001.patch, 
> YARN-8640-branch-2.7.002.patch, YARN-8640-branch-2.8.001.patch, 
> YARN-8640-branch-2.8.002.patch, YARN-8640.001.patch
>
>
> The container-executor function {{write_exit_code_file_as_nm}} had a number 
> of failure conditions where it just returns -1 without restoring previous 
> state.
> This is not a problem in any of the places where it is currently called, but 
> it could be a problem if future code changes call it before code that depends 
> on the previous state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6495) check docker container's exit code when writing to cgroup task files

2018-08-17 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584038#comment-16584038
 ] 

Jim Brennan commented on YARN-6495:
---

YARN-8656 removed the code that this Jira was fixing.  I think we can close 
this one now.

[~Jaeboo], [~ebadger], any objections?


> check docker container's exit code when writing to cgroup task files
> 
>
> Key: YARN-6495
> URL: https://issues.apache.org/jira/browse/YARN-6495
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Jaeboo Jeong
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-6495.001.patch, YARN-6495.002.patch
>
>
> If I execute simple command like date on docker container, the application 
> failed to complete successfully.
> for example, 
> {code}
> $ yarn  jar 
> $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
>  -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker -shell_command "date" -jar 
> $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar
>  -num_containers 1 -timeout 360
> …
> 17/04/12 00:16:40 INFO distributedshell.Client: Application did finished 
> unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring 
> loop
> 17/04/12 00:16:40 ERROR distributedshell.Client: Application failed to 
> complete successfully
> {code}
> The error log is like below.
> {code}
> ...
> Failed to write pid to file 
> /cgroup_parent/cpu/hadoop-yarn/container_/tasks - No such process
> ...
> {code}
> When writing pid to cgroup tasks, container-executor doesn’t check docker 
> container’s status.
> If the container finished very quickly, we can’t write pid to cgroup tasks, 
> and it is not problem.
> So container-executor needs to check docker container’s exit code during 
> writing pid to cgroup tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8640) Restore previous state in container-executor after failure

2018-08-17 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584024#comment-16584024
 ] 

Jim Brennan commented on YARN-8640:
---

[~jlowe], thanks for the review!  I have removed the changes to 
write_exit_code_file() in both patches.

 

> Restore previous state in container-executor after failure
> --
>
> Key: YARN-8640
> URL: https://issues.apache.org/jira/browse/YARN-8640
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2
>
> Attachments: YARN-8640-branch-2.7.001.patch, 
> YARN-8640-branch-2.8.001.patch, YARN-8640.001.patch
>
>
> The container-executor function {{write_exit_code_file_as_nm}} had a number 
> of failure conditions where it just returns -1 without restoring previous 
> state.
> This is not a problem in any of the places where it is currently called, but 
> it could be a problem if future code changes call it before code that depends 
> on the previous state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8675) Setting hostname of docker container breaks with "host" networking mode for Apps which do not run as a YARN service

2018-08-24 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592244#comment-16592244
 ] 

Jim Brennan commented on YARN-8675:
---

[~suma.shivaprasad] thanks for updating.  Patch 3 looks good to me.

 

> Setting hostname of docker container breaks with "host" networking mode for 
> Apps which do not run as a YARN service
> ---
>
> Key: YARN-8675
> URL: https://issues.apache.org/jira/browse/YARN-8675
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8675.1.patch, YARN-8675.2.patch, YARN-8675.3.patch
>
>
> Applications like the Spark AM currently do not run as a YARN service and 
> setting hostname breaks driver/executor communication if docker version 
> >=1.13.1 , especially with wire-encryption turned on.
> YARN-8027 sets the hostname if YARN DNS is enabled. But the cluster could 
> have a mix of YARN service/native Applications.
> The proposal is to not set the hostname when "host" networking mode is 
> enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8675) Setting hostname of docker container breaks with "host" networking mode for Apps which do not run as a YARN service

2018-08-24 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592142#comment-16592142
 ] 

Jim Brennan commented on YARN-8675:
---

[~suma.shivaprasad] Thanks for working on this.  I am still not clear on 
whether there exists any case in which we should be setting hostname when 
net=host?    As coded, if the YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME 
environment variable is set we will use it for hostname even if net=host.   Is 
this comment in DockerLinuxContainerRuntime.java still accurate?

{noformat}
 * YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME} sets the
 * hostname to be used by the Docker container. If not specified, a
 * hostname will be derived from the container ID.  This variable is
 * ignored if the network is 'host' and Registry DNS is not enabled.
{noformat}

> Setting hostname of docker container breaks with "host" networking mode for 
> Apps which do not run as a YARN service
> ---
>
> Key: YARN-8675
> URL: https://issues.apache.org/jira/browse/YARN-8675
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Suma Shivaprasad
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8675.1.patch, YARN-8675.2.patch
>
>
> Applications like the Spark AM currently do not run as a YARN service and 
> setting hostname breaks driver/executor communication if docker version 
> >=1.13.1 , especially with wire-encryption turned on.
> YARN-8027 sets the hostname if YARN DNS is enabled. But the cluster could 
> have a mix of YARN service/native Applications.
> The proposal is to not set the hostname when "host" networking mode is 
> enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker

2018-08-28 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8648:
--
Attachment: YARN-8648.001.patch

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-17 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584479#comment-16584479
 ] 

Jim Brennan commented on YARN-8648:
---

I have been experimenting with the following incomplete approach:
 * CGroupsHandler
 ** Add missing controllers to the list of supported controllers
 ** Add initializeAllCGroupControllers()
 *** Initializes all of the cgroups controllers that were not already 
initialized by a ResourceHandler - this is mainly creating the hierarchy 
(hadoop-yarn) cgroup or verifying that it is there and writable.
 ** Add CreateCGroupAllControllers(containerId)
 *** Creates the containerId cgroup under all cgroup controllers
 ** Add DeleteCGroupAllControllers(containerId)
 *** Deletes the containerId cgroup under all cgroup controllers
 *  ResourceHandlerModule
 ** Add wrappers to call the above methods.
 * LinuxContainerExecutor
 ** Add calls to above methods if the runtime is Docker (would probably be 
better to move these to the runtime)

So far I have been testing with pre-mounted cgroup hierarchies.  That is, I 
manually created the hadoop-yarn cgroup under each controller.

I've run into several problems experimenting with this approach on RHEL 7:
 * The hadoop-yarn cgroup under the following controllers is being deleted by 
the system (when I let it sit idle for a while): blkio, devices, memory, pids

 ** I got around this for now by just not adding pids to the list and skipping 
the others in the new methods.  We are not leaking cgroups for these 
controllers.
 * I am still leaking cgroups under /sys/fs/cgroup/systemd
 ** Even if I add "systemd" as one of the supported controllers, our mount-tab 
parsing code does not find it because it's not really a controller.
 * This feels pretty hacky - it might be better to just add a new 
dockerCGroupResourceHandler (as I mentioned above) to do effectively the same 
thing - we'd have to supply the list of controllers in a config property and 
deal with systemd.   The way things are right now we would still have to add 
these to the list of supported controllers, because most of the interfaces are 
based on a controller enum.  But even moving it to a separateResourceHandler 
still seems hacky.
 * I haven't tested the mount-cgroup path yet, but I believe we would need to 
configure all of the controllers that we need to mount in 
container-executor.cfg.

The main advantage to something along these lines is that it preserves the 
existing cgroups hiearchy, and no additional code is needed to deal with cgroup 
parameters.   The other advantage is that we are pre-creating the hadoop-yarn 
cgroups with the correct owner/permissions - docker creates them as root.

At this point, I'm not sure if I should proceed with this approach and I'm 
looking for opinions.

The options I am considering are:
 # The approach I've been experimenting with, cleaned up
 # The minimal, just-fix-the-leak approach, which would be to add a 
cleanupCGroups() method to the runtime.
 ** We call it after calling the ResourceHandlers.postComplete() in LCE.
 ** Docker would be the only runtime that implements it.
 ** We'd need to add a container-executor function to handle it.
 ** It could search for the containerId cgroup under all mounted cgroups and 
delete any that it finds

 *** Would not delete any that still have processes
 *** Security concerns?
 # The let-docker-be-docker approach
 ** This is the change-the-cgroup-parent approach.  Instead of passing 
/hadoop-yarn/containerId, we would just use /hadoop-yarn and let docker create 
its dockerContainerId cgroups under there.
 ** Solves the leak by just letting docker handle it - no intermediate 
containerId cgroups are created, so they don't need to be deleted by NM.
 ** To do this, I think we'd need to change every Cgroups ResourceHandler to do 
something different for Docker.  The main ones are for blkio and cpu.
 *** Don't create the containerId cgroups
 *** Don't modify cgroup params directly.
 *** Return the /hadoop-yarn/tasks path for the ADD_PID_TO_CGROUP operation so 
we set the cgroup parent correctly.
 *** Would likely need to add new PriviledgedOps for each cgroup parameter to 
pass them through (these are returned by ResourceHandler.preStart()).
 *** Add code to add each new cgroup parameter to docker run.
 *** Would need to support updating params via docker update command to support 
the ResourceHandler.updateContainer() method.
 *** [~billie.rinaldi], I've thought a bit more about the docker in docker 
case, which we thought would be a problem with this approach.   I think it is 
solvable though - you can obtain the name of the docker cgroup from 
/proc/self/cgroup.  I don't know if this is workable for your use-case though?

Comments?  Concerns?  Alternatives?

cc:[~jlowe], [~ebadger], [~shaneku...@gmail.com], [~billie.rinaldi], [~eyang]

> Container cgroups are leaked when using docker
> 

[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-29 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596345#comment-16596345
 ] 

Jim Brennan commented on YARN-8648:
---

Looks like this is ready for review.

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-31 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599213#comment-16599213
 ] 

Jim Brennan commented on YARN-8648:
---

[~jlowe] thanks for the review!
{quote}Why was the postComplete call moved in reapContainer to before the 
container is removed via docker? Shouldn't docker first remove its cgroups for 
the container before we remove ours?
{quote}
I was trying to preserve the order of operations. Normally postComplete is 
called immediately after the launchContainer() returns, and then 
reapContainer() is called as part of cleanupContainer() processing.
 So the resource handlers usually get a chance to clean up cgroups before we 
cleanup the container. If we do the docker cleanup first, it will delete the 
cgroups before the resource handler postComplete processing - it doesn't know 
which ones are handled by resource handlers, so it just deletes them all. Since 
they both are really just deleting the cgroups, I don't think the order matters 
that much, so I will move it back if you think that is better.
{quote}Is there a reason to separate removing docker cgroups from removing the 
docker container? This seems like a natural extension to cleaning up after a 
container run by docker, and that's already covered by the reap command. The 
patch would remain a docker-only change but without needing to modify the 
container-executor interface.
{quote}
It is currently being done as part of the reap processing, but as a separate 
privileged operation. We definitely could just add this processing to the 
remove-docker-container processing in container-executor, but it would require 
adding the yarn-hierarchy as an additional argument for the DockerRmCommand. 
This would also require changing the DockerContainerDeletionTask() to store the 
yarn-hierarchy String along with the ContainerId. Despite the additional 
container-executor interface, I think the current approach is less 
code/simpler, but I'm definitely willing to rework it if you think it is a 
better solution.
{quote}Nit: PROC_MOUNT_PATH should be a macro (i.e.: #define) or lower-cased. 
Similar for CGROUP_MOUNT.
{quote}
I will fix these.
{quote}The snprintf result should be checked for truncation in addition to 
output errors (i.e.: result >= PATH_MAX means it was truncated) otherwise we 
formulate an incomplete path targeted for deletion if that somehow occurs. 
Alternatively the code could use make_string or asprintf to allocate an 
appropriately sized buffer for each entry rather than trying to reuse a 
manually sized buffer.
{quote}
I will fix this. I forgot about make_string().
{quote}Is there any point in logging to the error file that a path we want to 
delete has already been deleted? This seems like it will just be noise, 
especially if systemd or something else is periodically cleaning some of these 
empty cgroups.
{quote}
I'll remove it - was nice while debugging, but not needed.
{quote}Related to the previous comment, the rmdir result should be checked for 
ENOENT and treat that as success.
{quote}
I explicitly check that the directory exists before calling rmdir, so I'm not 
sure this is necessary, but I can add it anyway.
{quote}Nit: I think lineptr should be freed in the cleanup label in case 
someone later adds a fatal error that jumps to cleanup.
{quote}
Will do.

Thanks again for the review!

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> 

[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-31 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599235#comment-16599235
 ] 

Jim Brennan commented on YARN-8648:
---

{quote}
I explicitly check that the directory exists before calling rmdir, so I'm not 
sure this is necessary, but I can add it anyway.
{quote}

There is a small window where it could be removed between the exist check and 
the rmdir, so it is necessary.   I'm tempted to just remove the dir_exists() 
check.



> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-09-04 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603606#comment-16603606
 ] 

Jim Brennan commented on YARN-8648:
---

Put up another patch to fix the checkstyle issue.

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker

2018-09-04 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8648:
--
Attachment: YARN-8648.003.patch

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker

2018-09-08 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8648:
--
Attachment: YARN-8648.004.patch

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch, YARN-8648.004.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-09-08 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608212#comment-16608212
 ] 

Jim Brennan commented on YARN-8648:
---

I have uploaded a patch that adds the cgroup cleanup to the DockerRmCommand.

This also includes some fixes for exec_docker_command() in container-executor.c.
 * No longer passes optind, which was eclipsing the global variable of the same 
name
 * Fix stack overrun error - was allocating using sizeof(char) instead of 
sizeof(char *)
 * Use optind for indexing into argv instead of assuming args start at 2.

Added a wrapper function remove_docker_container() that forks and calls 
exec_docker_command() in the child so I could add the cgroup cleanup after it 
finishes.

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch, YARN-8648.004.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-09-04 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603202#comment-16603202
 ] 

Jim Brennan commented on YARN-8648:
---

I've uploaded a patch that addresses most of issues raised by [~jlowe] - except 
for moving the functionality to the Docker RM command - I wanted to put up 
these other changes before reworking that part.

I misspoke in my earlier comment - I don't think any change would be needed to 
DockerContainerDeletionService, because it ends up calling 
LinuxContainerExecutor.removeDockerContainer(), which can lookup the 
yarn-hierarchy.  My only reservation to moving this to the DockerRmCommand is 
that most (if not all) arguments to Docker*Commands are actual command line 
arguments for the docker command.  This would be an exception to that.  Not 
sure how much that matters, because I agree this cleanup does naturally align 
with removing the container.

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker

2018-09-04 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8648:
--
Attachment: YARN-8648.002.patch

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-09-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609206#comment-16609206
 ] 

Jim Brennan commented on YARN-8648:
---

This is ready for review.

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch, YARN-8648.002.patch, 
> YARN-8648.003.patch, YARN-8648.004.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539286#comment-16539286
 ] 

Jim Brennan commented on YARN-8515:
---

Here is an example case that we saw:
Docker ps info for this container:
{noformat}
968e4a1a0fca 90188f3d752e "bash /grid/4/tmp/..." 6 days ago Exited (143) 6 days 
ago container_e07_1528760012992_2875921_01_69
{noformat}
NM Log with some added info from Docker container and journalctl to show where 
the docker container started/exited:
{noformat}
2018-06-27 16:32:48,779 [IPC Server handler 9 on 8041] INFO 
containermanager.ContainerManagerImpl: Start request for 
container_e07_1528760012992_2875921_01_69 by user p_condor
2018-06-27 16:32:48,782 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Adding 
container_e07_1528760012992_2875921_01_69 to application 
application_1528760012992_2875921
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from NEW to 
LOCALIZING
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Initializing container 
container_e07_1528760012992_2875921_01_69
2018-06-27 16:32:48,786 [AsyncDispatcher event handler] INFO 
localizer.ResourceLocalizationService: Created localizer for 
container_e07_1528760012992_2875921_01_69
2018-06-27 16:32:48,786 [LocalizerRunner for 
container_e07_1528760012992_2875921_01_69] INFO 
localizer.ResourceLocalizationService: Writing credentials to the nmPrivate 
file 
/grid/4/tmp/yarn-local/nmPrivate/container_e07_1528760012992_2875921_01_69.tokens.
 Credentials list: 
2018-06-27 16:32:52,654 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to 
LOCALIZED
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to 
RUNNING
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Starting resource-monitoring for 
container_e07_1528760012992_2875921_01_69

2018-06-27 16:32:53.345 Docker container started

2018-06-27 16:32:54,429 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id 
container_e07_1528760012992_2875921_01_69: 132.5 MB of 3 GB physical memory 
used; 4.3 GB of 6.3 GB virtual memory used

2018-06-27 16:33:25,422 [main] INFO nodemanager.NodeManager: STARTUP_MSG: 
/
STARTUP_MSG: Starting NodeManager
STARTUP_MSG: user = mapred
STARTUP_MSG: host = gsbl607n22.blue.ygrid.yahoo.com/10.213.59.232
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.3.2.1806111934

2018-06-27 16:33:31,140 [main] INFO containermanager.ContainerManagerImpl: 
Recovering container_e07_1528760012992_2875921_01_69 in state LAUNCHED with 
exit code -1000
2018-06-27 16:33:31,140 [main] INFO application.ApplicationImpl: Adding 
container_e07_1528760012992_2875921_01_69 to application 
application_1528760012992_2875921

2018-06-27 16:33:32,771 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers: 
2018-06-27 16:33:33,280 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers: 
2018-06-27 16:33:33,178 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers:

2018-06-27 16:33:33,776 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from NEW to 
LOCALIZING
2018-06-27 16:33:34,393 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Initializing container 
container_e07_1528760012992_2875921_01_69
2018-06-27 16:33:34,433 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to 
LOCALIZED
2018-06-27 16:33:34,461 [ContainersLauncher #23] INFO 
nodemanager.ContainerExecutor: Reacquiring 
container_e07_1528760012992_2875921_01_69 with pid 103072
2018-06-27 16:33:34,463 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to 
RUNNING
2018-06-27 16:33:34,482 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Starting resource-monitoring for 
container_e07_1528760012992_2875921_01_69

2018-06-27 16:33:35,304 [main] INFO nodemanager.NodeStatusUpdaterImpl: Sending 
out 598 NM container statuses: 
2018-06-27 16:33:35,356 [main] INFO nodemanager.NodeStatusUpdaterImpl: 
Registering with RM using containers 
2018-06-27 16:33:35,902 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id 

[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8515:
-

 Summary: container-executor can crash with SIGPIPE after 
nodemanager restart
 Key: YARN-8515
 URL: https://issues.apache.org/jira/browse/YARN-8515
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan


When running with docker on large clusters, we have noticed that sometimes 
docker containers are not removed - they remain in the exited state, and the 
corresponding container-executor is no longer running.  Upon investigation, we 
noticed that this always seemed to happen after a nodemanager restart.   The 
sequence leading to the stranded docker containers is:
 # Nodemanager restarts
 # Containers are recovered and then run for a while
 # Containers are killed for some (legitimate) reason
 # Container-executor exits without removing the docker container.

After reproducing this on a test cluster, we found that the container-executor 
was exiting due to a SIGPIPE.

What is happening is that the shell command executor that is used to start 
container-executor has threads reading from c-e's stdout and stderr.  When the 
NM is restarted, these threads are killed.  Then when the container-executor 
continues executing after the container exits with error, it tries to write to 
stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is not handled, this 
crashes the container-executor before it can actually remove the docker 
container.

We ran into this in branch 2.8.  The way docker containers are removed has been 
completely redesigned in trunk, so I don't think it will lead to this exact 
failure, but after an NM restart, potentially any write to stderr or stdout in 
the container-executor could cause it to crash.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539290#comment-16539290
 ] 

Jim Brennan commented on YARN-8515:
---

I have been able to repro this reliably on a test cluster.
Repro steps are:
# Start sleep job with a lot of mappers sleeping for 50 seconds
# on one worker node, kill NM after a set of containers starts
# restart the NM
# On the gw, kill the application (before the current containers finish)

This will leave the containers on the node where the nodemanager was restarted 
in the exited state.

container-executor is not cleaning up the docker containers. Here is an strace 
of one of the container-executors when the application is killed:
{noformat}
-bash-4.2$ sudo strace -s 4096 -f -p 7176
strace: Process 7176 attached
read(3, "143\n", 4096) = 4
close(3) = 0
wait4(7566, [\{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 7566
--- SIGCHLD \{si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7566, si_uid=0, 
si_status=0, si_utime=1, si_stime=0} ---
munmap(0x7f233bfa4000, 4096) = 0
write(2, "Docker container exit code was not zero: 143\n", 45) = -1 EPIPE 
(Broken pipe)
--- SIGPIPE \{si_signo=SIGPIPE, si_code=SI_USER, si_pid=7176, si_uid=0} ---
+++ killed by SIGPIPE +++
{noformat}

The problem is that when container-executor is started by the NM using the 
priviledged operation executor, it attaches stream readers to stdout and stderr.
When we restart the NM, these threads are killed. Then when the application is 
killed, it kills the running containers and container-executor returns from 
waiting for the docker container. When it tries to write an error message to 
stderr, it generates a SIGPIPE signal, because the other end of the pipe has 
been killed. Since we are not handling that signal, container-executor crashes 
and we never remove the docker container.

I have verified that if I change container-executor to ignore SIGPIPE, the 
problem does not occur.

> container-executor can crash with SIGPIPE after nodemanager restart
> ---
>
> Key: YARN-8515
> URL: https://issues.apache.org/jira/browse/YARN-8515
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> When running with docker on large clusters, we have noticed that sometimes 
> docker containers are not removed - they remain in the exited state, and the 
> corresponding container-executor is no longer running.  Upon investigation, 
> we noticed that this always seemed to happen after a nodemanager restart.   
> The sequence leading to the stranded docker containers is:
>  # Nodemanager restarts
>  # Containers are recovered and then run for a while
>  # Containers are killed for some (legitimate) reason
>  # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the 
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start 
> container-executor has threads reading from c-e's stdout and stderr.  When 
> the NM is restarted, these threads are killed.  Then when the 
> container-executor continues executing after the container exits with error, 
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is 
> not handled, this crashes the container-executor before it can actually 
> remove the docker container.
> We ran into this in branch 2.8.  The way docker containers are removed has 
> been completely redesigned in trunk, so I don't think it will lead to this 
> exact failure, but after an NM restart, potentially any write to stderr or 
> stdout in the container-executor could cause it to crash.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-11 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8518:
-

 Summary: test-container-executor test_is_empty() is broken
 Key: YARN-8518
 URL: https://issues.apache.org/jira/browse/YARN-8518
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan


A new test was recently added to test-container-executor.c that has some 
problems.

It is attempting to mkdir() a hard-coded path: /tmp/2938rf2983hcqnw8ud/emptydir

This fails because the base directory is not there.  These directories are not 
being cleaned up either.

It should be using TEST_ROOT.

I don't know what Jira this change was made under - the git commit from July 9 
2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-11 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540597#comment-16540597
 ] 

Jim Brennan commented on YARN-8518:
---

[~rkanter], [~szegedim], let me know if you would like me to put up a patch for 
this.

 

> test-container-executor test_is_empty() is broken
> -
>
> Key: YARN-8518
> URL: https://issues.apache.org/jira/browse/YARN-8518
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Priority: Major
>
> A new test was recently added to test-container-executor.c that has some 
> problems.
> It is attempting to mkdir() a hard-coded path: 
> /tmp/2938rf2983hcqnw8ud/emptydir
> This fails because the base directory is not there.  These directories are 
> not being cleaned up either.
> It should be using TEST_ROOT.
> I don't know what Jira this change was made under - the git commit from July 
> 9 2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-12 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542171#comment-16542171
 ] 

Jim Brennan commented on YARN-8518:
---

[~rkanter], can you please review this fix?

 

> test-container-executor test_is_empty() is broken
> -
>
> Key: YARN-8518
> URL: https://issues.apache.org/jira/browse/YARN-8518
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8518.001.patch
>
>
> A new test was recently added to test-container-executor.c that has some 
> problems.
> It is attempting to mkdir() a hard-coded path: 
> /tmp/2938rf2983hcqnw8ud/emptydir
> This fails because the base directory is not there.  These directories are 
> not being cleaned up either.
> It should be using TEST_ROOT.
> I don't know what Jira this change was made under - the git commit from July 
> 9 2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-11 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-8518:
-

Assignee: Jim Brennan

> test-container-executor test_is_empty() is broken
> -
>
> Key: YARN-8518
> URL: https://issues.apache.org/jira/browse/YARN-8518
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> A new test was recently added to test-container-executor.c that has some 
> problems.
> It is attempting to mkdir() a hard-coded path: 
> /tmp/2938rf2983hcqnw8ud/emptydir
> This fails because the base directory is not there.  These directories are 
> not being cleaned up either.
> It should be using TEST_ROOT.
> I don't know what Jira this change was made under - the git commit from July 
> 9 2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-12 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8515:
--
Attachment: YARN-8515.001.patch

> container-executor can crash with SIGPIPE after nodemanager restart
> ---
>
> Key: YARN-8515
> URL: https://issues.apache.org/jira/browse/YARN-8515
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8515.001.patch
>
>
> When running with docker on large clusters, we have noticed that sometimes 
> docker containers are not removed - they remain in the exited state, and the 
> corresponding container-executor is no longer running.  Upon investigation, 
> we noticed that this always seemed to happen after a nodemanager restart.   
> The sequence leading to the stranded docker containers is:
>  # Nodemanager restarts
>  # Containers are recovered and then run for a while
>  # Containers are killed for some (legitimate) reason
>  # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the 
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start 
> container-executor has threads reading from c-e's stdout and stderr.  When 
> the NM is restarted, these threads are killed.  Then when the 
> container-executor continues executing after the container exits with error, 
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is 
> not handled, this crashes the container-executor before it can actually 
> remove the docker container.
> We ran into this in branch 2.8.  The way docker containers are removed has 
> been completely redesigned in trunk, so I don't think it will lead to this 
> exact failure, but after an NM restart, potentially any write to stderr or 
> stdout in the container-executor could cause it to crash.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-12 Thread Jim Brennan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8518:
--
Attachment: YARN-8518.001.patch

> test-container-executor test_is_empty() is broken
> -
>
> Key: YARN-8518
> URL: https://issues.apache.org/jira/browse/YARN-8518
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8518.001.patch
>
>
> A new test was recently added to test-container-executor.c that has some 
> problems.
> It is attempting to mkdir() a hard-coded path: 
> /tmp/2938rf2983hcqnw8ud/emptydir
> This fails because the base directory is not there.  These directories are 
> not being cleaned up either.
> It should be using TEST_ROOT.
> I don't know what Jira this change was made under - the git commit from July 
> 9 2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-12 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541894#comment-16541894
 ] 

Jim Brennan commented on YARN-8518:
---

The unit test failure is not related to this change and it looks like there is 
a Jira for it YARN-5857

I think this is ready for review.

 

> test-container-executor test_is_empty() is broken
> -
>
> Key: YARN-8518
> URL: https://issues.apache.org/jira/browse/YARN-8518
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8518.001.patch
>
>
> A new test was recently added to test-container-executor.c that has some 
> problems.
> It is attempting to mkdir() a hard-coded path: 
> /tmp/2938rf2983hcqnw8ud/emptydir
> This fails because the base directory is not there.  These directories are 
> not being cleaned up either.
> It should be using TEST_ROOT.
> I don't know what Jira this change was made under - the git commit from July 
> 9 2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-12 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541878#comment-16541878
 ] 

Jim Brennan commented on YARN-8518:
---

I can confirm that it is running this test for pre-commit builds - I just hit 
this failure on YARN-8515.

 

> test-container-executor test_is_empty() is broken
> -
>
> Key: YARN-8518
> URL: https://issues.apache.org/jira/browse/YARN-8518
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8518.001.patch
>
>
> A new test was recently added to test-container-executor.c that has some 
> problems.
> It is attempting to mkdir() a hard-coded path: 
> /tmp/2938rf2983hcqnw8ud/emptydir
> This fails because the base directory is not there.  These directories are 
> not being cleaned up either.
> It should be using TEST_ROOT.
> I don't know what Jira this change was made under - the git commit from July 
> 9 2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-12 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541868#comment-16541868
 ] 

Jim Brennan commented on YARN-8515:
---

The unit test failure is YARN-8518.  Might want to wait for that one to go 
through before we continue with this one, just to see that 
test-container-executor succeeds.

I tested this manually, running several test jobs and restarting the NM while 
jobs were running.  Because trunk has [~shaneku...@gmail.com]'s docker 
life-cycle changes, I don't see the same failure I saw on branch 2.8, but the 
patch does not introduce any new problems that I can see.

 

 

> container-executor can crash with SIGPIPE after nodemanager restart
> ---
>
> Key: YARN-8515
> URL: https://issues.apache.org/jira/browse/YARN-8515
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8515.001.patch
>
>
> When running with docker on large clusters, we have noticed that sometimes 
> docker containers are not removed - they remain in the exited state, and the 
> corresponding container-executor is no longer running.  Upon investigation, 
> we noticed that this always seemed to happen after a nodemanager restart.   
> The sequence leading to the stranded docker containers is:
>  # Nodemanager restarts
>  # Containers are recovered and then run for a while
>  # Containers are killed for some (legitimate) reason
>  # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the 
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start 
> container-executor has threads reading from c-e's stdout and stderr.  When 
> the NM is restarted, these threads are killed.  Then when the 
> container-executor continues executing after the container exits with error, 
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is 
> not handled, this crashes the container-executor before it can actually 
> remove the docker container.
> We ran into this in branch 2.8.  The way docker containers are removed has 
> been completely redesigned in trunk, so I don't think it will lead to this 
> exact failure, but after an NM restart, potentially any write to stderr or 
> stdout in the container-executor could cause it to crash.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2018-02-28 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381033#comment-16381033
 ] 

Jim Brennan commented on YARN-7677:
---

Uploaded another patch that fixes the extra import reported by checkstyle.  As 
noted for previous patches, I am not going to fix the too many arguments 
checkstyle issues, as adding an argument to writeLaunchEnv and sanitizeEnv is 
appropriate for this change.

The unit test failure for TestContainerSchedulerQueuing is a separate issue: 
[YARN-7700]

 

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2018-02-28 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-7677:
--
Attachment: YARN-7677.007.patch

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR

2018-02-28 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381129#comment-16381129
 ] 

Jim Brennan commented on YARN-7677:
---

Check-style issues are expected, as noted above. 

Unit test failure is tracked by YARN-7700

[~jlowe], this is ready for review.

> Docker image cannot set HADOOP_CONF_DIR
> ---
>
> Key: YARN-7677
> URL: https://issues.apache.org/jira/browse/YARN-7677
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Eric Badger
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-7677.001.patch, YARN-7677.002.patch, 
> YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, 
> YARN-7677.006.patch, YARN-7677.007.patch
>
>
> Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether 
> it's set by the user or not. It completely bypasses the whitelist and so 
> there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes 
> problems in the Docker use case where Docker containers will set up their own 
> environment and have their own {{HADOOP_CONF_DIR}} preset in the image 
> itself. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-12 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8027:
-

 Summary: Setting hostname of docker container breaks for 
--net=host in docker 1.13
 Key: YARN-8027
 URL: https://issues.apache.org/jira/browse/YARN-8027
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0
Reporter: Jim Brennan
Assignee: Jim Brennan


In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
argument to the docker run command to set the hostname in the container to 
something like:  ctr-e84-1520889172376-0001-01-01.

This does not work when combined with the --net=host command line option in 
Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
the hostname and it fails.

We haven't seen this before because we were using docker 1.12.6 which seems to 
ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396115#comment-16396115
 ] 

Jim Brennan commented on YARN-8027:
---

This code was added by [YARN-6804].

[~billie.rinaldi], [~jianh], I don't think we should be setting --hostname when 
--net=host.  Do you agree?


> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-13 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397007#comment-16397007
 ] 

Jim Brennan commented on YARN-8027:
---

{quote}We should look into whether it is a bug in that version of Docker. I see 
a couple of tickets regarding adding support for setting hostname when 
net=host, which would indicate that is a valid setting. I have not dug far 
enough to determine which versions are supposed to support it.
{quote}
[~billie.rinaldi], I think it is actually the opposite. Specifying --hostname 
with --net=host was broken before docker 1.13.1, which is why it didn't cause 
us a problem. In 1.13.1 though, it works, which breaks our ability to resolve 
the hostname, since we are not using Registry DNS.

I agree with [~jlowe] and [~shaneku...@gmail.com], we should only set the 
hostname when Registry DNS is enabled, as long as this is indeed always the 
case. We haven't experimented with user-defined networks here - is it the case 
that Registry DNS must always be used for user-defined networks?

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-13 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8027:
--
Attachment: YARN-8027.001.patch

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators

2018-03-14 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399006#comment-16399006
 ] 

Jim Brennan commented on YARN-8029:
---

There was a related discussion in [HADOOP-11640] about allowing the user to 
specify an alternate delimiter or adding an escaping mechanism.

I think in this case, the better solution would be to change the docker runtime 
environment variables to use a different separator -  semicolon, or pipe, or 
something else.

> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
> 
>
> Key: YARN-8029
> URL: https://issues.apache.org/jira/browse/YARN-8029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Priority: Major
>
> The following docker-related environment variables specify a comma-separated 
> list of mounts:
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS
> This is a problem because hadoop -Dmapreduce.map.env and related options use  
> comma as a delimiter.   So if I put more than one mount in 
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be 
> treated as a delimiter for the hadoop command line option and all but the 
> first mount will be ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators

2018-03-14 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8029:
-

 Summary: YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use 
commas as separators
 Key: YARN-8029
 URL: https://issues.apache.org/jira/browse/YARN-8029
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0
Reporter: Jim Brennan


The following docker-related environment variables specify a comma-separated 
list of mounts:

YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS
YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS

This is a problem because hadoop -Dmapreduce.map.env and related options use  
comma as a delimiter.   So if I put more than one mount in 
YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be treated 
as a delimiter for the hadoop command line option and all but the first mount 
will be ignored.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-14 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398752#comment-16398752
 ] 

Jim Brennan commented on YARN-8027:
---

My thinking was that the only known case where there is a problem is with 
--net=host, so I was keeping the change narrowed to that case.

With network set to bridge or none, the default hostname for the container is 
the container id, and it is not resolvable inside the container,  so changing 
it to a more useful name seems relatively harmless.   For user defined 
networks, I'm unsure if there is a case where we would want to set the 
container name without using Registry DNS.

I'm happy to simplify this to just check Registry DNS if 
[~shaneku...@gmail.com] and [~billie.rinaldi] agree that is the best solution.

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-14 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398635#comment-16398635
 ] 

Jim Brennan commented on YARN-8027:
---

The unit test failure (testKillOpportunisticForGuaranteedContainer) does not 
appear to be related to my changes.

[~jlowe], [~shaneku...@gmail.com], [~billie.rinaldi], I believe this is ready 
for review.

 

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-14 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399355#comment-16399355
 ] 

Jim Brennan commented on YARN-8027:
---

[~suma.shivaprasad], thanks for your comment.  It sounds like the current patch 
would be ok with you then?  It preserves the current behavior except in the 
case where network is host and Registry DNS is not enabled.

 

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators

2018-03-14 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399240#comment-16399240
 ] 

Jim Brennan commented on YARN-8029:
---

Thanks [~shaneku...@gmail.com]. This does appear to be a duplicate of 
YARN-6830, with respect to the underlying problem, but proposes a different 
solution. Supporting the ability to quote the values does seem like a natural 
approach - it's the first thing I tried to do.

I proposed changing the delimiters in these docker runtime variables because it 
is a safe change - it can't break anything because it's currently not working 
with commas. While commas seems like a natural choice for the delimiter, I 
don't think changing it to something else would be much of a hardship as long 
as it is documented.

I'm willing to work on either this or YARN-6830, depending on which option is 
favored. cc: [~jlowe], [~templedf], [~aw],

> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
> 
>
> Key: YARN-8029
> URL: https://issues.apache.org/jira/browse/YARN-8029
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Priority: Major
>
> The following docker-related environment variables specify a comma-separated 
> list of mounts:
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS
> This is a problem because hadoop -Dmapreduce.map.env and related options use  
> comma as a delimiter.   So if I put more than one mount in 
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be 
> treated as a delimiter for the hadoop command line option and all but the 
> first mount will be ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8064) Docker ".cmd" files should not be put in hadoop.tmp.dir

2018-04-06 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428841#comment-16428841
 ] 

Jim Brennan commented on YARN-8064:
---

[~ebadger], one question - why are we retaining the old version of 
writeCommandToTempFile(), which is still being used by executeDockerCommand()?  
Might be good to have comments that describe under which conditions each 
version should be used.

 

 

 

> Docker ".cmd" files should not be put in hadoop.tmp.dir
> ---
>
> Key: YARN-8064
> URL: https://issues.apache.org/jira/browse/YARN-8064
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-8064.001.patch, YARN-8064.002.patch, 
> YARN-8064.003.patch, YARN-8064.004.patch, YARN-8064.005.patch
>
>
> Currently all of the docker command files are being put into 
> {{hadoop.tmp.dir}}, which doesn't get cleaned up. So, eventually all of the 
> inodes will fill up and no more tasks will be able to run



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6830) Support quoted strings for environment variables

2018-04-04 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425644#comment-16425644
 ] 

Jim Brennan commented on YARN-6830:
---

Solution proposed by [~aw] for mapreduce variables is being addressed in 
MAPREDUCE-7069.

> Support quoted strings for environment variables
> 
>
> Key: YARN-6830
> URL: https://issues.apache.org/jira/browse/YARN-6830
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-6830.001.patch, YARN-6830.002.patch, 
> YARN-6830.003.patch, YARN-6830.004.patch
>
>
> There are cases where it is necessary to allow for quoted string literals 
> within environment variables values when passed via the yarn command line 
> interface.
> For example, consider the follow environment variables for a MR map task.
> {{MODE=bar}}
> {{IMAGE_NAME=foo}}
> {{MOUNTS=/tmp/foo,/tmp/bar}}
> When running the MR job, these environment variables are supplied as a comma 
> delimited string.
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> In this case, {{MOUNTS}} will be parsed and added to the task environment as 
> {{MOUNTS=/tmp/foo}}. Any attempts to quote the embedded comma separated value 
> results in quote characters becoming part of the value, and parsing still 
> breaks down at the comma.
> This issue is to allow for quoting the comma separated value (escaped double 
> or single quote). This was mentioned on YARN-4595 and will impact YARN-5534 
> as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reopened YARN-8027:
---

Reopening so I can put up a patch for branch 3.

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8027:
--
Attachment: YARN-8027-branch-3.001.patch

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.001.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8027:
--
Attachment: YARN-8027-branch-3.0.001.patch

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.0.001.patch, 
> YARN-8027-branch-3.001.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8027:
--
Attachment: (was: YARN-8027-branch-3.001.patch)

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.0.001.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435712#comment-16435712
 ] 

Jim Brennan commented on YARN-8027:
---

Renamed branch-3 patch for branch-3.0.

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.0.001.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6434) When setting environment variables, can't use comma for a list of value in key = value pairs.

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436088#comment-16436088
 ] 

Jim Brennan commented on YARN-6434:
---

[~Jaeboo], please close this if you agree that it is resolved by 
[MAPREDUCE-7069].


> When setting environment variables, can't use comma for a list of value in 
> key = value pairs.
> -
>
> Key: YARN-6434
> URL: https://issues.apache.org/jira/browse/YARN-6434
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jaeboo Jeong
>Priority: Major
> Attachments: YARN-6434-trunk.001.patch, YARN-6434.001.patch
>
>
> We can set environment variables using yarn.app.mapreduce.am.env, 
> mapreduce.map.env, mapreduce.reduce.env.
> There is no problem if we use key=value pairs like X=Y, X=$Y.
> However If we want to set key=a list of value pair(e.g. X=Y,Z), we can’t.
> This is related to YARN-4595.
> The attached patch is based on YARN-3768.
> We can set environment variables like below.
> {code}
> mapreduce.map.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker,YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS=\"/dir1:/targetdir1,/dir2:/targetdir2\""
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435891#comment-16435891
 ] 

Jim Brennan commented on YARN-8027:
---

Missed fixing that test that randomly picks which network to use.  It will fail 
when the network happens to be 'host'.

 

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.0.001.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435906#comment-16435906
 ] 

Jim Brennan commented on YARN-8027:
---

Submitted new branch-3.0 patch that fixes the broken TestDockerContainerRuntime 
test.

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.0.001.patch, 
> YARN-8027-branch-3.0.002.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8027:
--
Attachment: YARN-8027-branch-3.0.002.patch

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.0.001.patch, 
> YARN-8027-branch-3.0.002.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6434) When setting environment variables, can't use comma for a list of value in key = value pairs.

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436084#comment-16436084
 ] 

Jim Brennan commented on YARN-6434:
---

This issue was resolved in a different way in MAPREDUCE-7069.  You can now 
specify variables that have commas in them individually, e.g., 
{{mapreduce.map.env.VARNAME=value}}.

> When setting environment variables, can't use comma for a list of value in 
> key = value pairs.
> -
>
> Key: YARN-6434
> URL: https://issues.apache.org/jira/browse/YARN-6434
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jaeboo Jeong
>Priority: Major
> Attachments: YARN-6434-trunk.001.patch, YARN-6434.001.patch
>
>
> We can set environment variables using yarn.app.mapreduce.am.env, 
> mapreduce.map.env, mapreduce.reduce.env.
> There is no problem if we use key=value pairs like X=Y, X=$Y.
> However If we want to set key=a list of value pair(e.g. X=Y,Z), we can’t.
> This is related to YARN-4595.
> The attached patch is based on YARN-3768.
> We can set environment variables like below.
> {code}
> mapreduce.map.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker,YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS=\"/dir1:/targetdir1,/dir2:/targetdir2\""
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436076#comment-16436076
 ] 

Jim Brennan commented on YARN-8027:
---

[~jlowe], this one is ready for review.

 

> Setting hostname of docker container breaks for --net=host in docker 1.13
> -
>
> Key: YARN-8027
> URL: https://issues.apache.org/jira/browse/YARN-8027
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-8027-branch-3.0.001.patch, 
> YARN-8027-branch-3.0.002.patch, YARN-8027.001.patch
>
>
> In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
> argument to the docker run command to set the hostname in the container to 
> something like:  ctr-e84-1520889172376-0001-01-01.
> This does not work when combined with the --net=host command line option in 
> Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
> the hostname and it fails.
> We haven't seen this before because we were using docker 1.12.6 which seems 
> to ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436207#comment-16436207
 ] 

Jim Brennan commented on YARN-8071:
---

MAPREDUCE-7069 resolved this problem for the following properties:
{quote}mapreduce.map.env.VARNAME=value 
 mapreduce.reduce.env.VARNAME=value 
 yarn.app.mapreduce.am.env.VARNAME=value 
 yarn.app.mapreduce.am.admin.user.env.VARNAME=value
{quote}
The remaining YARN environment variable property is: 
{{yarn.nodemanager.admin-env}}
 I am planning to use this Jira to add support for the 
{{yarn.nodemanager.admin-env.VARNAME=value}} syntax to allow variables with 
commas to be specified for this property.

[~jlowe], [~shaneku...@gmail.com], please let me know if you agree this is 
needed, and also if I'm missing any other properties.

> Provide Spark-like API for setting Environment Variables to enable vars with 
> commas
> ---
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually

2018-04-12 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8071:
--
Attachment: YARN-8071.001.patch

> Add ability to specify nodemanager environment variables individually
> -
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8071.001.patch
>
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually

2018-04-12 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8071:
--
Summary: Add ability to specify nodemanager environment variables 
individually  (was: Provide Spark-like API for setting Environment Variables to 
enable vars with commas)

> Add ability to specify nodemanager environment variables individually
> -
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8071) Add ability to specify nodemanager environment variables individually

2018-04-12 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436387#comment-16436387
 ] 

Jim Brennan commented on YARN-8071:
---

Changed the description to be more accurate about what this Jira will address.

> Add ability to specify nodemanager environment variables individually
> -
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8071) Add ability to specify nodemanager environment variables individually

2018-04-13 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437537#comment-16437537
 ] 

Jim Brennan commented on YARN-8071:
---

[~jlowe], I believe this patch is ready for review.

 

> Add ability to specify nodemanager environment variables individually
> -
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8071.001.patch, YARN-8071.002.patch
>
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}
> The mapreduce properties were dealt with in [MAPREDUCE-7069].  This Jira will 
> address the YARN properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually

2018-04-13 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8071:
--
Description: 
YARN-6830 describes a problem where environment variables that contain commas 
cannot be specified via {{-Dmapreduce.map.env}}.

For example:

{{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}

will set {{MOUNTS}} to {{/tmp/foo}}

In that Jira, [~aw] suggested that we change the API to provide a way to 
specify environment variables individually, the same way that Spark does.
{quote}Rather than fight with a regex why not redefine the API instead?

 

-Dmapreduce.map.env.MODE=bar
 -Dmapreduce.map.env.IMAGE_NAME=foo
 -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar

...

e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar

This greatly simplifies the input validation needed and makes it clear what is 
actually being defined.
{quote}

The mapreduce properties were dealt with in [MAPREDUCE-7069].  This Jira will 
address the YARN properties.

  was:
YARN-6830 describes a problem where environment variables that contain commas 
cannot be specified via {{-Dmapreduce.map.env}}.

For example:

{{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}

will set {{MOUNTS}} to {{/tmp/foo}}

In that Jira, [~aw] suggested that we change the API to provide a way to 
specify environment variables individually, the same way that Spark does.
{quote}Rather than fight with a regex why not redefine the API instead?

 

-Dmapreduce.map.env.MODE=bar
 -Dmapreduce.map.env.IMAGE_NAME=foo
 -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar

...

e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar

This greatly simplifies the input validation needed and makes it clear what is 
actually being defined.
{quote}


> Add ability to specify nodemanager environment variables individually
> -
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8071.001.patch
>
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}
> The mapreduce properties were dealt with in [MAPREDUCE-7069].  This Jira will 
> address the YARN properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually

2018-04-13 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-8071:
--
Attachment: YARN-8071.002.patch

> Add ability to specify nodemanager environment variables individually
> -
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8071.001.patch, YARN-8071.002.patch
>
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}
> The mapreduce properties were dealt with in [MAPREDUCE-7069].  This Jira will 
> address the YARN properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7667) Docker Stop grace period should be configurable

2018-04-06 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428497#comment-16428497
 ] 

Jim Brennan commented on YARN-7667:
---

Patch looks good to me.

> Docker Stop grace period should be configurable
> ---
>
> Key: YARN-7667
> URL: https://issues.apache.org/jira/browse/YARN-7667
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Attachments: YARN-7667.001.patch, YARN-7667.002.patch, 
> YARN-7667.003.patch
>
>
> {{DockerStopCommand}} has a {{setGracePeriod}} method, but it is never 
> called. So, the stop uses the 10 second default grace period from docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-04-20 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446405#comment-16446405
 ] 

Jim Brennan commented on YARN-7654:
---

I'm not going to repeat all of the arguments, but I agree with [~jlowe] and 
[~ebadger].

The main point I would like to add is that [~eyang]'s proposal seems to rest on 
the assumption that we will do YARN-8097, exposing the {{--env-file-}} option 
to the end-user.  I don't agree that this is necessary nor desired.   IIRC,  
YARN-8097 was filed in response to one of [~jlowe]'s earlier reviews of this 
Jira, where he recommended using {{-env-file}} *instead of* a list of {{-e 
key=value}} pairs.

[~jlowe]'s original comment on this (which I still find very compelling):
{quote}Actually now that I think about this more, I think we can side step the 
pipe character hack, the comma problems, the refactoring of the command file, 
etc., if we leverage the --env-file feature of docker-run. Rather than try to 
pass untrusted user data on the docker-run command-line and the real potential 
of accidentally letting some of these "variables" appear as different 
command-line directives to docker, we can dump the variables to a new file next 
to the existing command file that contains the environment variable settings, 
one variable per line. Then we just pass --env-file with the path to the file. 
That way Docker will never misinterpret this data as anything but environment 
variables, we don't have to mess with pipe encoding to try to get these 
variables marshalled through the command file before they get to the 
container-executor, and we don't have to worry about how to properly marshal 
them on the command-line for the docker command. As a bonus, I think that 
precludes needing to refactor the container-executor to do the argument array 
stuff since we're not trying to pass user-specified env variables on the 
command-line. That lets us make this JIRA a lot smaller and more focused, and 
we can move the execv changes to a separate JIRA that wouldn't block this one.
{quote}
I do not see any value in providing two ways to specify environment variables 
to docker, and the {{--env-file}} approach is much cleaner and easier to 
maintain in code.

Perhaps we should consider YARN-8097 on its own.

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8097) Add support for Docker env-file switch

2018-04-20 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446414#comment-16446414
 ] 

Jim Brennan commented on YARN-8097:
---

[~eyang], [~jlowe], [~ebadger], [~shaneku...@gmail.com], my understanding is 
that this Jira was filed in response to a comment from [~jlowe] in YARN-7654 
where he recommended using {{--env-file}} instead of {{-e key=value}} pairs. I 
don't think it was [~jlowe]'s intent to expose this capability to the end-user 
as another way of providing environment variables.

> Add support for Docker env-file switch
> --
>
> Key: YARN-8097
> URL: https://issues.apache.org/jira/browse/YARN-8097
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.2.0
>Reporter: Eric Yang
>Priority: Major
> Attachments: YARN-8097.001.patch
>
>
> There are two different ways to pass user environment variables to docker.  
> There is -e flag and --env-file which reference to a file that contains 
> environment variables key/value pair.  It would be nice to have a way to 
> express env-file from HDFS, and localize the .env file in container localized 
> directory and pass --env-file flag to docker run command.  This approach 
> would prevent ENV based password to show up in log file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8071) Add ability to specify nodemanager environment variables individually

2018-04-16 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439791#comment-16439791
 ] 

Jim Brennan commented on YARN-8071:
---

[~jlowe] thanks for the review.
{quote}The original code passed the current environment map, allowing admin 
variables to reference other variables defined so far in the environment. The 
new code passes an empty map which would seem to preclude this and could be a 
backwards compatibility issue.
{quote}
Thanks for pointing this out. I meant to ask specifically about this change. 
This was intentional. I agree it is a change in functionality, but it seemed to 
me that the current behavior may actually be a bug, not the intended behavior. 
I based this on the documentation for {{yarn.nodemanager.admin-env}}, the 
comment that precedes this code ({{variables here will be forced in, even if 
the container has specified them.}}, and the fact that everything else in this 
function overrides any user-specified variable (with the exception of the 
windows-specific classpath stuff).

That said, I don't have a good idea of how likely this change would be to break 
something, so I am definitely willing to change it if it is considered too 
dangerous.
{quote}The changes to TestContainerLaunch#testPrependDistcache appear to be 
unnecessary?
{quote}
They were intentional. When I was testing my new test case, I realized that 
passing the empty set for the {{nmVars}} argument leads to exceptions in 
{{addToEnvMap()}}, so I fixed the testPrependDistcache() cases as well - I 
assume this windows-only test must be failing without this fix.

> Add ability to specify nodemanager environment variables individually
> -
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8071.001.patch, YARN-8071.002.patch
>
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}
> The mapreduce properties were dealt with in [MAPREDUCE-7069].  This Jira will 
> address the YARN properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas

2018-03-27 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415782#comment-16415782
 ] 

Jim Brennan commented on YARN-8071:
---

[~jlowe], yes I think this will affect mapreduce, yarn, and common code.   I 
haven't done the analysis yet to figure out everything this will affect.  
Should this be refiled in hadoop common, or or should we add additional 
components to this Jira?

> Provide Spark-like API for setting Environment Variables to enable vars with 
> commas
> ---
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators

2018-03-29 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419392#comment-16419392
 ] 

Jim Brennan commented on YARN-8029:
---

Based on discussions in [YARN-6830], the preference is to provide a solution 
that allows the use of commas for these variables.  So we are not going to do 
this.


> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
> 
>
> Key: YARN-8029
> URL: https://issues.apache.org/jira/browse/YARN-8029
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-8029.001.patch, YARN-8029.002.patch
>
>
> The following docker-related environment variables specify a comma-separated 
> list of mounts:
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS
> This is a problem because hadoop -Dmapreduce.map.env and related options use  
> comma as a delimiter.   So if I put more than one mount in 
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be 
> treated as a delimiter for the hadoop command line option and all but the 
> first mount will be ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas

2018-03-28 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417505#comment-16417505
 ] 

Jim Brennan commented on YARN-8071:
---

@jlowe, I've filed [MAPREDUCE-7069] for addressing the mapreduce properties.  I 
will use this one to address the yarn properties.


> Provide Spark-like API for setting Environment Variables to enable vars with 
> commas
> ---
>
> Key: YARN-8071
> URL: https://issues.apache.org/jira/browse/YARN-8071
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> YARN-6830 describes a problem where environment variables that contain commas 
> cannot be specified via {{-Dmapreduce.map.env}}.
> For example:
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> will set {{MOUNTS}} to {{/tmp/foo}}
> In that Jira, [~aw] suggested that we change the API to provide a way to 
> specify environment variables individually, the same way that Spark does.
> {quote}Rather than fight with a regex why not redefine the API instead?
>  
> -Dmapreduce.map.env.MODE=bar
>  -Dmapreduce.map.env.IMAGE_NAME=foo
>  -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar
> ...
> e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> This greatly simplifies the input validation needed and makes it clear what 
> is actually being defined.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6830) Support quoted strings for environment variables

2018-03-20 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated YARN-6830:
--
Attachment: YARN-6830.002.patch

> Support quoted strings for environment variables
> 
>
> Key: YARN-6830
> URL: https://issues.apache.org/jira/browse/YARN-6830
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Shane Kumpf
>Priority: Major
> Attachments: YARN-6830.001.patch, YARN-6830.002.patch
>
>
> There are cases where it is necessary to allow for quoted string literals 
> within environment variables values when passed via the yarn command line 
> interface.
> For example, consider the follow environment variables for a MR map task.
> {{MODE=bar}}
> {{IMAGE_NAME=foo}}
> {{MOUNTS=/tmp/foo,/tmp/bar}}
> When running the MR job, these environment variables are supplied as a comma 
> delimited string.
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> In this case, {{MOUNTS}} will be parsed and added to the task environment as 
> {{MOUNTS=/tmp/foo}}. Any attempts to quote the embedded comma separated value 
> results in quote characters becoming part of the value, and parsing still 
> breaks down at the comma.
> This issue is to allow for quoting the comma separated value (escaped double 
> or single quote). This was mentioned on YARN-4595 and will impact YARN-5534 
> as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6830) Support quoted strings for environment variables

2018-03-20 Thread Jim Brennan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan reassigned YARN-6830:
-

Assignee: Jim Brennan  (was: Shane Kumpf)

> Support quoted strings for environment variables
> 
>
> Key: YARN-6830
> URL: https://issues.apache.org/jira/browse/YARN-6830
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-6830.001.patch, YARN-6830.002.patch, 
> YARN-6830.003.patch
>
>
> There are cases where it is necessary to allow for quoted string literals 
> within environment variables values when passed via the yarn command line 
> interface.
> For example, consider the follow environment variables for a MR map task.
> {{MODE=bar}}
> {{IMAGE_NAME=foo}}
> {{MOUNTS=/tmp/foo,/tmp/bar}}
> When running the MR job, these environment variables are supplied as a comma 
> delimited string.
> {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}
> In this case, {{MOUNTS}} will be parsed and added to the task environment as 
> {{MOUNTS=/tmp/foo}}. Any attempts to quote the embedded comma separated value 
> results in quote characters becoming part of the value, and parsing still 
> breaks down at the comma.
> This issue is to allow for quoting the comma separated value (escaped double 
> or single quote). This was mentioned on YARN-4595 and will impact YARN-5534 
> as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   >