[jira] [Commented] (YARN-7678) Ability to enable logging of container memory stats
[ https://issues.apache.org/jira/browse/YARN-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310302#comment-16310302 ] Jim Brennan commented on YARN-7678: --- Will do. Thanks! > Ability to enable logging of container memory stats > --- > > Key: YARN-7678 > URL: https://issues.apache.org/jira/browse/YARN-7678 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0, 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan > Attachments: YARN-7678.001.patch > > > YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO > to DEBUG. > We have found these log messages to be useful information in Out-of-Memory > situations - they provide detail that helps show the memory profile of the > container over time, which can be helpful in determining root cause. > Here's an example message from YARN-3424: > {noformat} > 2015-03-27 09:32:48,905 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Memory usage of ProcessTree 9215 for container-id > container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory > used; 2.6 GB of 2.1 GB virtual memory used > {noformat} > Propose to change this to use a separate logger for this message, so that we > can enable debug logging for this without enabling all of the other debug > logging for ContainersMonitorImpl. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7678) Ability to enable logging of container memory stats
[ https://issues.apache.org/jira/browse/YARN-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-7678: -- Attachment: YARN-7678-branch-2.001.patch Providing patch for branch-2. Repeated manual tests described above for this patch. > Ability to enable logging of container memory stats > --- > > Key: YARN-7678 > URL: https://issues.apache.org/jira/browse/YARN-7678 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0, 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan > Attachments: YARN-7678-branch-2.001.patch, YARN-7678.001.patch > > > YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO > to DEBUG. > We have found these log messages to be useful information in Out-of-Memory > situations - they provide detail that helps show the memory profile of the > container over time, which can be helpful in determining root cause. > Here's an example message from YARN-3424: > {noformat} > 2015-03-27 09:32:48,905 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Memory usage of ProcessTree 9215 for container-id > container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory > used; 2.6 GB of 2.1 GB virtual memory used > {noformat} > Propose to change this to use a separate logger for this message, so that we > can enable debug logging for this without enabling all of the other debug > logging for ContainersMonitorImpl. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7678) Ability to enable logging of container memory stats
[ https://issues.apache.org/jira/browse/YARN-7678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16311491#comment-16311491 ] Jim Brennan commented on YARN-7678: --- The unit test failure in TestContainerSchedulerQueuing is unrelated to this change. I reran that test locally and it still succeeds for me. I described the testing I've done in an earlier comment. I think the branch-2 patch is ready for review. > Ability to enable logging of container memory stats > --- > > Key: YARN-7678 > URL: https://issues.apache.org/jira/browse/YARN-7678 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.8.0, 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan > Attachments: YARN-7678-branch-2.001.patch, YARN-7678.001.patch > > > YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO > to DEBUG. > We have found these log messages to be useful information in Out-of-Memory > situations - they provide detail that helps show the memory profile of the > container over time, which can be helpful in determining root cause. > Here's an example message from YARN-3424: > {noformat} > 2015-03-27 09:32:48,905 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Memory usage of ProcessTree 9215 for container-id > container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory > used; 2.6 GB of 2.1 GB virtual memory used > {noformat} > Propose to change this to use a separate logger for this message, so that we > can enable debug logging for this without enabling all of the other debug > logging for ContainersMonitorImpl. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value
[ https://issues.apache.org/jira/browse/YARN-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518171#comment-16518171 ] Jim Brennan commented on YARN-8444: --- The bad value came from /proc/meminfo - it looks like it returned a negative value expressed as an unsigned decimal value, which was too big to parse as a long. > NodeResourceMonitor crashes on bad swapFree value > - > > Key: YARN-8444 > URL: https://issues.apache.org/jira/browse/YARN-8444 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.3, 3.0.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > Saw this on a node that was having difficulty preempting containers. Can't > have NodeResourceMonitor exiting. System was above 99% memory used at the > time so it may only be something that happens when normal preemption isn't > work right, but we should fix since this is a critical monitor to the health > of the node. > > {noformat} > 2018-06-04 14:28:08,539 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for > container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB > physical memory used; 5.0 GB of 7.3 GB virtual memory used > 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR > yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource > Monitor,5,main] threw an Exception. > java.lang.NumberFormatException: For input string: "18446744073709551596" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:592) > at java.lang.Long.parseLong(Long.java:631) > at > org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257) > at > org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591) > at > org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601) > at > org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74) > at > org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193) > 2018-06-04 14:28:30,747 > [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO > util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of > approximately 9330ms > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value
[ https://issues.apache.org/jira/browse/YARN-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8444: -- Description: Saw this on a node that was running out of memory. Can't have NodeResourceMonitor exiting. System was above 99% memory used at the time, so this is not a common occurrence, but we should fix since this is a critical monitor to the health of the node. {noformat} 2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory used; 5.0 GB of 7.3 GB virtual memory used 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] threw an Exception. java.lang.NumberFormatException: For input string: "18446744073709551596" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:592) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257) at org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591) at org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74) at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193) 2018-06-04 14:28:30,747 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9330ms {noformat} was: Saw this on a node that was having difficulty preempting containers. Can't have NodeResourceMonitor exiting. System was above 99% memory used at the time so it may only be something that happens when normal preemption isn't work right, but we should fix since this is a critical monitor to the health of the node. {noformat} 2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory used; 5.0 GB of 7.3 GB virtual memory used 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] threw an Exception. java.lang.NumberFormatException: For input string: "18446744073709551596" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:592) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257) at org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591) at org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74) at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193) 2018-06-04 14:28:30,747 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9330ms {noformat} > NodeResourceMonitor crashes on bad swapFree value > - > > Key: YARN-8444 > URL: https://issues.apache.org/jira/browse/YARN-8444 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.3, 3.0.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > Saw this on a node that was running out of memory. Can't have > NodeResourceMonitor exiting. System was above 99% memory used at the time, so > this is not a common occurrence, but we should fix since this is a critical > monitor to the health of the node. > > {noformat} > 2018-06-04 14:28:08,539 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for > container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB > physical memory used; 5.0 GB of 7.3 GB virtual memory used > 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR > yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource > Monitor,5,main] threw an Exception. > java.lang.NumberFormatException: For input string: "18446744073709551596" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:592) > at java.lang.Long.parseLong(Long.java:631) > at >
[jira] [Created] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value
Jim Brennan created YARN-8444: - Summary: NodeResourceMonitor crashes on bad swapFree value Key: YARN-8444 URL: https://issues.apache.org/jira/browse/YARN-8444 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.2, 2.8.3 Reporter: Jim Brennan Assignee: Jim Brennan Saw this on a node that was having difficulty preempting containers. Can't have NodeResourceMonitor exiting. System was above 99% memory used at the time so it may only be something that happens when normal preemption isn't work right, but we should fix since this is a critical monitor to the health of the node. {noformat} 2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory used; 5.0 GB of 7.3 GB virtual memory used 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] threw an Exception. java.lang.NumberFormatException: For input string: "18446744073709551596" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:592) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257) at org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591) at org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74) at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193) 2018-06-04 14:28:30,747 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9330ms {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value
[ https://issues.apache.org/jira/browse/YARN-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519549#comment-16519549 ] Jim Brennan commented on YARN-8444: --- [~eepayne], can you please review? > NodeResourceMonitor crashes on bad swapFree value > - > > Key: YARN-8444 > URL: https://issues.apache.org/jira/browse/YARN-8444 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.3, 3.0.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8444.001.patch > > > Saw this on a node that was running out of memory. Can't have > NodeResourceMonitor exiting. System was above 99% memory used at the time, so > this is not a common occurrence, but we should fix since this is a critical > monitor to the health of the node. > > {noformat} > 2018-06-04 14:28:08,539 [Container Monitor] DEBUG > ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for > container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB > physical memory used; 5.0 GB of 7.3 GB virtual memory used > 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR > yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource > Monitor,5,main] threw an Exception. > java.lang.NumberFormatException: For input string: "18446744073709551596" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:592) > at java.lang.Long.parseLong(Long.java:631) > at > org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257) > at > org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591) > at > org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601) > at > org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74) > at > org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193) > 2018-06-04 14:28:30,747 > [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO > util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of > approximately 9330ms > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails
[ https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8640: -- Attachment: YARN-8640.001.patch > Restore previous state in container-executor if write_exit_code_file_as_nm > fails > > > Key: YARN-8640 > URL: https://issues.apache.org/jira/browse/YARN-8640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8640.001.patch > > > The container-executor function {{write_exit_code_file_as_nm}} had a number > of failure conditions where it just returns -1 without restoring previous > state. > This is not a problem in any of the places where it is currently called, but > it could be a problem if future code changes call it before code that depends > on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576916#comment-16576916 ] Jim Brennan commented on YARN-8648: --- One proposal to fix the leaking cgroups is to have docker put its containers directly under the {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}} directory. For example, instead of using {{cgroup-parent=/hadoop-yarn/container_id-}}, we use {{cgroup-parent=/hadoop-yarn}}. This does cause docker to create a {{hadoop-yarn}} cgroup under each resource type, and it does not clean those up, but that is just one unused cgroup per resource type vs hundreds of thousands. This can be done by just passing an empty string to DockerLinuxContainerRuntime.addCGroupParentIfRequired(), or otherwise changing it to ignore the containerIdStr. Doing this and removing the code that cherry-picks the PID in container-executor does work, but the NM still creates the per-container cgroups as well - they're just not used. The other issue with this approach is that the cpu.shares is still updated (to reflect the requested vcores allotment) in the per-container cgroup, so it is ignored. In our code, we addressed this by passing the cpu.shares value in the docker run --cpu-shares command line argument. I'm still thinking about the best way to address this. Currently most of the resourceHandler processing happens at the linuxContainerExecutor level. But there is clearly a difference in how cgroups need to be handled for docker vs linux cases. In the docker case, we should arguably use docker command line arguments instead of directly setting up cgroups. One option would be to provide a runtime interface useResourceHandlers() which for Docker would return false. We could then disable all of the resource handling processing that happens in the container executor, and add the necessary interfaces to handle cgroup parameters to the docker runtime. Another option would be to move the resource handler processing down into the runtime. This is a bigger change, but may be cleaner. The docker runtime may still just ignore those handlers, but that detail would be hidden at the container executor level. cc:, [~ebadger] [~jlowe] [~eyang] [~shaneku...@gmail.com] [~billie.rinaldi] > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8648: -- Labels: Docker (was: ) > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576855#comment-16576855 ] Jim Brennan commented on YARN-8648: --- Another problem we have seen is that container-executor still has code that cherry-picks the PID of the launch shell from the docker container and writes that into the {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/tasks}} file, effectively moving it from {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}} to {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id}}. So you end up with one process out of the container in the {{container_id}} cgroup, and the rest in the {{container_id/docker_container_id}} cgroup. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6495) check docker container's exit code when writing to cgroup task files
[ https://issues.apache.org/jira/browse/YARN-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576925#comment-16576925 ] Jim Brennan commented on YARN-6495: --- As part of YARN-8648, I am proposing that we can just remove the code that this patch is fixing. If we are using cgroups, we are passing the {{cgroup-parent}} argument to docker, which accomplishes what this code was trying to do in a much more deterministic and reliable way. My proposal would be to remove this code as part of YARN-8648, but if there is a preference for doing that in a separate Jira, I can file a new one. Assuming there is agreement, I think we can close out this Jira. [~Jaeboo], [~ebadger], do you agree? > check docker container's exit code when writing to cgroup task files > > > Key: YARN-6495 > URL: https://issues.apache.org/jira/browse/YARN-6495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Jaeboo Jeong >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-6495.001.patch, YARN-6495.002.patch > > > If I execute simple command like date on docker container, the application > failed to complete successfully. > for example, > {code} > $ yarn jar > $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar > -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker -shell_command "date" -jar > $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar > -num_containers 1 -timeout 360 > … > 17/04/12 00:16:40 INFO distributedshell.Client: Application did finished > unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring > loop > 17/04/12 00:16:40 ERROR distributedshell.Client: Application failed to > complete successfully > {code} > The error log is like below. > {code} > ... > Failed to write pid to file > /cgroup_parent/cpu/hadoop-yarn/container_/tasks - No such process > ... > {code} > When writing pid to cgroup tasks, container-executor doesn’t check docker > container’s status. > If the container finished very quickly, we can’t write pid to cgroup tasks, > and it is not problem. > So container-executor needs to check docker container’s exit code during > writing pid to cgroup tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8648) Container cgroups are leaked when using docker
Jim Brennan created YARN-8648: - Summary: Container cgroups are leaked when using docker Key: YARN-8648 URL: https://issues.apache.org/jira/browse/YARN-8648 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan When you run with docker and enable cgroups for cpu, docker creates cgroups for all resources on the system, not just for cpu. For instance, if the {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, the nodemanager will create a cgroup for each container under {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path via the {{--cgroup-parent}} command line argument. Docker then creates a cgroup for the docker container under that, for instance: {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. When the container exits, docker cleans up the {{docker_container_id}} cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is good under {{/sys/fs/cgroup/hadoop-yarn}}. The problem is that docker also creates that same hierarchy under every resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, perf_event, and systemd.So for instance, docker creates {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up the {{container_id}} cgroups for these other resources. On one of our busy clusters, we found > 100,000 of these leaked cgroups. I found this in our 2.8-based version of hadoop, but I have been able to repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails
Jim Brennan created YARN-8640: - Summary: Restore previous state in container-executor if write_exit_code_file_as_nm fails Key: YARN-8640 URL: https://issues.apache.org/jira/browse/YARN-8640 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan The container-executor function {{write_exit_code_file_as_nm}} had a number of failure conditions where it just returns -1 without restoring previous state. This is not a problem in any of the places where it is currently called, but it could be a problem if future code changes call it before code that depends on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576855#comment-16576855 ] Jim Brennan edited comment on YARN-8648 at 8/10/18 9:37 PM: Another problem we have seen is that container-executor still has code that cherry-picks the PID of the launch shell from the docker container and writes that into the {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/tasks}} file, effectively moving it from {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}} to {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id}}. So you end up with one process out of the container in the {{container_id}} cgroup, and the rest in the {{container_id/docker_container_id}} cgroup. Since we are passing the {{--cgroup-parent}} to docker, there is no need to manually write the pid - we can just remove the code that does this. was (Author: jim_brennan): Another problem we have seen is that container-executor still has code that cherry-picks the PID of the launch shell from the docker container and writes that into the {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/tasks}} file, effectively moving it from {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}} to {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id}}. So you end up with one process out of the container in the {{container_id}} cgroup, and the rest in the {{container_id/docker_container_id}} cgroup. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
Jim Brennan created YARN-8656: - Summary: container-executor should not write cgroup tasks files for docker containers Key: YARN-8656 URL: https://issues.apache.org/jira/browse/YARN-8656 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker run}} to ensure that all processes for the container are placed into a cgroup under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. Docker creates a cgroup there with the docker container id as the name and all of the processes in the container go into that cgroup. container-executor has code in {{launch_docker_container_as_user()}} that then cherry-picks the PID of the docker container (usually the launch shell) and writes that into the {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively moving it from {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with one process out of the container in the {{container_id}} cgroup, and the rest in the {{container_id/docker_container_id}} cgroup. Since we are passing the {{--cgroup-parent}} to docker, there is no need to manually write the container pid to the tasks file - we can just remove the code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6495) check docker container's exit code when writing to cgroup task files
[ https://issues.apache.org/jira/browse/YARN-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578844#comment-16578844 ] Jim Brennan commented on YARN-6495: --- {quote}My proposal would be to remove this code as part of YARN-8648, but if there is a preference for doing that in a separate Jira, I can file a new one. Assuming there is agreement, I think we can close out this Jira. {quote} I decided to file a new Jira: YARN-8656 for this, rather than lumping it in with YARN-8648. > check docker container's exit code when writing to cgroup task files > > > Key: YARN-6495 > URL: https://issues.apache.org/jira/browse/YARN-6495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Jaeboo Jeong >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-6495.001.patch, YARN-6495.002.patch > > > If I execute simple command like date on docker container, the application > failed to complete successfully. > for example, > {code} > $ yarn jar > $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar > -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker -shell_command "date" -jar > $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar > -num_containers 1 -timeout 360 > … > 17/04/12 00:16:40 INFO distributedshell.Client: Application did finished > unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring > loop > 17/04/12 00:16:40 ERROR distributedshell.Client: Application failed to > complete successfully > {code} > The error log is like below. > {code} > ... > Failed to write pid to file > /cgroup_parent/cpu/hadoop-yarn/container_/tasks - No such process > ... > {code} > When writing pid to cgroup tasks, container-executor doesn’t check docker > container’s status. > If the container finished very quickly, we can’t write pid to cgroup tasks, > and it is not problem. > So container-executor needs to check docker container’s exit code during > writing pid to cgroup tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
[ https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan reassigned YARN-8656: - Assignee: Jim Brennan > container-executor should not write cgroup tasks files for docker containers > > > Key: YARN-8656 > URL: https://issues.apache.org/jira/browse/YARN-8656 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker > run}} to ensure that all processes for the container are placed into a cgroup > under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. > Docker creates a cgroup there with the docker container id as the name and > all of the processes in the container go into that cgroup. > container-executor has code in {{launch_docker_container_as_user()}} that > then cherry-picks the PID of the docker container (usually the launch shell) > and writes that into the > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively > moving it from > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with > one process out of the container in the {{container_id}} cgroup, and the rest > in the {{container_id/docker_container_id}} cgroup. > Since we are passing the {{--cgroup-parent}} to docker, there is no need to > manually write the container pid to the tasks file - we can just remove the > code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
[ https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581110#comment-16581110 ] Jim Brennan commented on YARN-8656: --- I am unable to repro the unit test failure in TestContainerManager#testLocalingResourceWhileContainerRunning. I don't think it is related to my change. > container-executor should not write cgroup tasks files for docker containers > > > Key: YARN-8656 > URL: https://issues.apache.org/jira/browse/YARN-8656 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8656.001.patch, YARN-8656.002.patch > > > If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker > run}} to ensure that all processes for the container are placed into a cgroup > under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. > Docker creates a cgroup there with the docker container id as the name and > all of the processes in the container go into that cgroup. > container-executor has code in {{launch_docker_container_as_user()}} that > then cherry-picks the PID of the docker container (usually the launch shell) > and writes that into the > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively > moving it from > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with > one process out of the container in the {{container_id}} cgroup, and the rest > in the {{container_id/docker_container_id}} cgroup. > Since we are passing the {{--cgroup-parent}} to docker, there is no need to > manually write the container pid to the tasks file - we can just remove the > code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580067#comment-16580067 ] Jim Brennan commented on YARN-8648: --- [~jlowe] thanks for the comment. {quote}We should consider breaking this up into two JIRAs if it proves difficult to hash through the design. It's a relatively small change to move the docker containers under the top-level YARN cgroup hierarchy to fixes the cgroup leaks, with the side-effect that the NM continues to create and cleanup unused cgroups per docker container launched. We could follow up that change with another JIRA to resolve the new design for the cgroup / container runtime interaction so those empty cgroups are avoided in the docker case. If we can hash it out quickly in one JIRA that's great, but I want to make sure the leak problem doesn't linger while we work through the architecture of cgroups and container runtimes. {quote} The main issue with doing this quick fix for the cgroups leak is that any cgroup parameters written by the various resource handlers will be ignored in the docker case because they will be written to the unused container cgroup. Internally, we added a cpu-shares option to docker to handle the cpu resource because that is the only one we're using, but for community I think we need to address them all. Is it worth breaking cgroups parameters temporarily for docker to fix the leak? > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-8640) Restore previous state in container-executor after failure
[ https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan reopened YARN-8640: --- Reopening so I can provide patches for branch-2.7 and branch-2.8. > Restore previous state in container-executor after failure > -- > > Key: YARN-8640 > URL: https://issues.apache.org/jira/browse/YARN-8640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2 > > Attachments: YARN-8640.001.patch > > > The container-executor function {{write_exit_code_file_as_nm}} had a number > of failure conditions where it just returns -1 without restoring previous > state. > This is not a problem in any of the places where it is currently called, but > it could be a problem if future code changes call it before code that depends > on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580234#comment-16580234 ] Jim Brennan commented on YARN-8648: --- {quote}I am wondering if this approach would break the docker launching docker use case. Currently if you're launching a new docker container from an existing docker container, you can have the new container use the same cgroup as the first container (e.g. /hadoop-yarn/${CONTAINER_ID}), but if there weren't a unique cgroup parent for the container you wouldn't be able to do that. Unless there's a way to find out the docker container id from inside the container? {quote} Thanks [~billie.rinaldi]. Yes I think this use-case would break as you suggest. {quote}One potential issue with a useResourceHandlers() approach is if the NM wants to manipulate cgroup settings on a live container. Having a runtime that says it doesn't use resource handlers implies that can't be done by that runtime, but it can be supported by the docker runtime {quote} Agreed [~jlowe]. I no longer think useResourceHandlers() is a good approach I don't have a full solution in mind yet, but one question is whether we should continue using the per-container cgroup as the cgroup parent for docker. The main advantage to maintaining it is that there is a lot of code that already depends on it. All existing resource handlers just work with this setup. The disadvantage is that it makes fixing the leak harder because docker is creating hierarchies under the unused resource types (cpuset, hugetlb, etc...) and it creates them as root making it harder for the NM to remove them. If we use the top-level (hadoop-yarn) as the cgroup parent, then docker cleans everything up pretty nicely (although it still leaks the top-level hadoop-yarn cgroup for the unused-by-NM resources. But it breaks the case [~billie.rinaldi] mentioned above, and requires that we convert all existing resource handlers to use docker command options in the docker case. One thought I had is adding a dockerCleaupResourceHandler that we tack on to the end of the resourceHandlerChain - it's only job would be to clean up the extra container cgroups that docker creates. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8640) Restore previous state in container-executor after failure
[ https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8640: -- Attachment: YARN-8640-branch-2.8.001.patch YARN-8640-branch-2.7.001.patch > Restore previous state in container-executor after failure > -- > > Key: YARN-8640 > URL: https://issues.apache.org/jira/browse/YARN-8640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2 > > Attachments: YARN-8640-branch-2.7.001.patch, > YARN-8640-branch-2.8.001.patch, YARN-8640.001.patch > > > The container-executor function {{write_exit_code_file_as_nm}} had a number > of failure conditions where it just returns -1 without restoring previous > state. > This is not a problem in any of the places where it is currently called, but > it could be a problem if future code changes call it before code that depends > on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
[ https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8656: -- Attachment: YARN-8656.002.patch > container-executor should not write cgroup tasks files for docker containers > > > Key: YARN-8656 > URL: https://issues.apache.org/jira/browse/YARN-8656 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8656.001.patch, YARN-8656.002.patch > > > If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker > run}} to ensure that all processes for the container are placed into a cgroup > under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. > Docker creates a cgroup there with the docker container id as the name and > all of the processes in the container go into that cgroup. > container-executor has code in {{launch_docker_container_as_user()}} that > then cherry-picks the PID of the docker container (usually the launch shell) > and writes that into the > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively > moving it from > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with > one process out of the container in the {{container_id}} cgroup, and the rest > in the {{container_id/docker_container_id}} cgroup. > Since we are passing the {{--cgroup-parent}} to docker, there is no need to > manually write the container pid to the tasks file - we can just remove the > code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8640) Restore previous state in container-executor after failure
[ https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583053#comment-16583053 ] Jim Brennan commented on YARN-8640: --- [~jlowe] I'm not sure what happened here with genericqa? > Restore previous state in container-executor after failure > -- > > Key: YARN-8640 > URL: https://issues.apache.org/jira/browse/YARN-8640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2 > > Attachments: YARN-8640-branch-2.7.001.patch, > YARN-8640-branch-2.8.001.patch, YARN-8640.001.patch > > > The container-executor function {{write_exit_code_file_as_nm}} had a number > of failure conditions where it just returns -1 without restoring previous > state. > This is not a problem in any of the places where it is currently called, but > it could be a problem if future code changes call it before code that depends > on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails
[ https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578547#comment-16578547 ] Jim Brennan commented on YARN-8640: --- Tested by running test-container-executor and cetest - both pass with no errors. Also ran sleep and pi jobs on a single node cluster with and without docker, and also with NM restart during the jobs. [~jlowe] please review. > Restore previous state in container-executor if write_exit_code_file_as_nm > fails > > > Key: YARN-8640 > URL: https://issues.apache.org/jira/browse/YARN-8640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8640.001.patch > > > The container-executor function {{write_exit_code_file_as_nm}} had a number > of failure conditions where it just returns -1 without restoring previous > state. > This is not a problem in any of the places where it is currently called, but > it could be a problem if future code changes call it before code that depends > on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
[ https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8656: -- Attachment: YARN-8656.001.patch > container-executor should not write cgroup tasks files for docker containers > > > Key: YARN-8656 > URL: https://issues.apache.org/jira/browse/YARN-8656 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8656.001.patch > > > If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker > run}} to ensure that all processes for the container are placed into a cgroup > under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. > Docker creates a cgroup there with the docker container id as the name and > all of the processes in the container go into that cgroup. > container-executor has code in {{launch_docker_container_as_user()}} that > then cherry-picks the PID of the docker container (usually the launch shell) > and writes that into the > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively > moving it from > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with > one process out of the container in the {{container_id}} cgroup, and the rest > in the {{container_id/docker_container_id}} cgroup. > Since we are passing the {{--cgroup-parent}} to docker, there is no need to > manually write the container pid to the tasks file - we can just remove the > code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
[ https://issues.apache.org/jira/browse/YARN-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579978#comment-16579978 ] Jim Brennan commented on YARN-8656: --- I have tested this by running test-container-executor, cetest, and nodemanager unit tests. I've also run some jobs on a single node cluster and manually verified that with Docker the single PID is no longer written to the {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file. All PIDs for the container appear in the {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id/tasks}} file, which is managed by docker. I do have one question: should I remove the resources_values argument from {{launch_docker_container_as_user()}}, since it is no longer used? Could also remove it from DockerLinuxContainerRuntime.buildLaunchOp(). [~jlowe], [~ebadger], [~eyang], [~shaneku...@gmail.com] - thoughts? > container-executor should not write cgroup tasks files for docker containers > > > Key: YARN-8656 > URL: https://issues.apache.org/jira/browse/YARN-8656 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8656.001.patch > > > If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker > run}} to ensure that all processes for the container are placed into a cgroup > under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. > Docker creates a cgroup there with the docker container id as the name and > all of the processes in the container go into that cgroup. > container-executor has code in {{launch_docker_container_as_user()}} that > then cherry-picks the PID of the docker container (usually the launch shell) > and writes that into the > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively > moving it from > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to > {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with > one process out of the container in the {{container_id}} cgroup, and the rest > in the {{container_id/docker_container_id}} cgroup. > Since we are passing the {{--cgroup-parent}} to docker, there is no need to > manually write the container pid to the tasks file - we can just remove the > code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586060#comment-16586060 ] Jim Brennan commented on YARN-8648: --- Thanks [~eyang]! My main concern about the minimal fix is the security aspect, since we will need to add an option to container-executor to tell it to delete all cgroups with a particular name as root (since docker will create them as root). I think this is mitigated if we use the "cgroup" section of the container-exectutor.cfg to constrain it. This is currently used to enable updating params, but I think it could be used for this as well. It already defines the CGROUPS_ROOT (e.g., /sys/fs/cgroup), and the YARN_HIERARCHY (e.g, hadoop-yarn). We could either add another config parameter to define the list of hierarchies to clean up (e.g, cpuset, freezer, hugetlb, etc...), or we can parse /proc/mounts to determine the full list. I think it's safer to add the config parameter. I will start working on this version unless there are objections? > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8640) Restore previous state in container-executor after failure
[ https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8640: -- Attachment: YARN-8640-branch-2.8.002.patch YARN-8640-branch-2.7.002.patch > Restore previous state in container-executor after failure > -- > > Key: YARN-8640 > URL: https://issues.apache.org/jira/browse/YARN-8640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2 > > Attachments: YARN-8640-branch-2.7.001.patch, > YARN-8640-branch-2.7.002.patch, YARN-8640-branch-2.8.001.patch, > YARN-8640-branch-2.8.002.patch, YARN-8640.001.patch > > > The container-executor function {{write_exit_code_file_as_nm}} had a number > of failure conditions where it just returns -1 without restoring previous > state. > This is not a problem in any of the places where it is currently called, but > it could be a problem if future code changes call it before code that depends > on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6495) check docker container's exit code when writing to cgroup task files
[ https://issues.apache.org/jira/browse/YARN-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584038#comment-16584038 ] Jim Brennan commented on YARN-6495: --- YARN-8656 removed the code that this Jira was fixing. I think we can close this one now. [~Jaeboo], [~ebadger], any objections? > check docker container's exit code when writing to cgroup task files > > > Key: YARN-6495 > URL: https://issues.apache.org/jira/browse/YARN-6495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Jaeboo Jeong >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-6495.001.patch, YARN-6495.002.patch > > > If I execute simple command like date on docker container, the application > failed to complete successfully. > for example, > {code} > $ yarn jar > $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar > -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker -shell_command "date" -jar > $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar > -num_containers 1 -timeout 360 > … > 17/04/12 00:16:40 INFO distributedshell.Client: Application did finished > unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring > loop > 17/04/12 00:16:40 ERROR distributedshell.Client: Application failed to > complete successfully > {code} > The error log is like below. > {code} > ... > Failed to write pid to file > /cgroup_parent/cpu/hadoop-yarn/container_/tasks - No such process > ... > {code} > When writing pid to cgroup tasks, container-executor doesn’t check docker > container’s status. > If the container finished very quickly, we can’t write pid to cgroup tasks, > and it is not problem. > So container-executor needs to check docker container’s exit code during > writing pid to cgroup tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8640) Restore previous state in container-executor after failure
[ https://issues.apache.org/jira/browse/YARN-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584024#comment-16584024 ] Jim Brennan commented on YARN-8640: --- [~jlowe], thanks for the review! I have removed the changes to write_exit_code_file() in both patches. > Restore previous state in container-executor after failure > -- > > Key: YARN-8640 > URL: https://issues.apache.org/jira/browse/YARN-8640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2 > > Attachments: YARN-8640-branch-2.7.001.patch, > YARN-8640-branch-2.8.001.patch, YARN-8640.001.patch > > > The container-executor function {{write_exit_code_file_as_nm}} had a number > of failure conditions where it just returns -1 without restoring previous > state. > This is not a problem in any of the places where it is currently called, but > it could be a problem if future code changes call it before code that depends > on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8675) Setting hostname of docker container breaks with "host" networking mode for Apps which do not run as a YARN service
[ https://issues.apache.org/jira/browse/YARN-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592244#comment-16592244 ] Jim Brennan commented on YARN-8675: --- [~suma.shivaprasad] thanks for updating. Patch 3 looks good to me. > Setting hostname of docker container breaks with "host" networking mode for > Apps which do not run as a YARN service > --- > > Key: YARN-8675 > URL: https://issues.apache.org/jira/browse/YARN-8675 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Major > Labels: Docker > Attachments: YARN-8675.1.patch, YARN-8675.2.patch, YARN-8675.3.patch > > > Applications like the Spark AM currently do not run as a YARN service and > setting hostname breaks driver/executor communication if docker version > >=1.13.1 , especially with wire-encryption turned on. > YARN-8027 sets the hostname if YARN DNS is enabled. But the cluster could > have a mix of YARN service/native Applications. > The proposal is to not set the hostname when "host" networking mode is > enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8675) Setting hostname of docker container breaks with "host" networking mode for Apps which do not run as a YARN service
[ https://issues.apache.org/jira/browse/YARN-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592142#comment-16592142 ] Jim Brennan commented on YARN-8675: --- [~suma.shivaprasad] Thanks for working on this. I am still not clear on whether there exists any case in which we should be setting hostname when net=host? As coded, if the YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME environment variable is set we will use it for hostname even if net=host. Is this comment in DockerLinuxContainerRuntime.java still accurate? {noformat} * YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME} sets the * hostname to be used by the Docker container. If not specified, a * hostname will be derived from the container ID. This variable is * ignored if the network is 'host' and Registry DNS is not enabled. {noformat} > Setting hostname of docker container breaks with "host" networking mode for > Apps which do not run as a YARN service > --- > > Key: YARN-8675 > URL: https://issues.apache.org/jira/browse/YARN-8675 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Suma Shivaprasad >Priority: Major > Labels: Docker > Attachments: YARN-8675.1.patch, YARN-8675.2.patch > > > Applications like the Spark AM currently do not run as a YARN service and > setting hostname breaks driver/executor communication if docker version > >=1.13.1 , especially with wire-encryption turned on. > YARN-8027 sets the hostname if YARN DNS is enabled. But the cluster could > have a mix of YARN service/native Applications. > The proposal is to not set the hostname when "host" networking mode is > enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8648: -- Attachment: YARN-8648.001.patch > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584479#comment-16584479 ] Jim Brennan commented on YARN-8648: --- I have been experimenting with the following incomplete approach: * CGroupsHandler ** Add missing controllers to the list of supported controllers ** Add initializeAllCGroupControllers() *** Initializes all of the cgroups controllers that were not already initialized by a ResourceHandler - this is mainly creating the hierarchy (hadoop-yarn) cgroup or verifying that it is there and writable. ** Add CreateCGroupAllControllers(containerId) *** Creates the containerId cgroup under all cgroup controllers ** Add DeleteCGroupAllControllers(containerId) *** Deletes the containerId cgroup under all cgroup controllers * ResourceHandlerModule ** Add wrappers to call the above methods. * LinuxContainerExecutor ** Add calls to above methods if the runtime is Docker (would probably be better to move these to the runtime) So far I have been testing with pre-mounted cgroup hierarchies. That is, I manually created the hadoop-yarn cgroup under each controller. I've run into several problems experimenting with this approach on RHEL 7: * The hadoop-yarn cgroup under the following controllers is being deleted by the system (when I let it sit idle for a while): blkio, devices, memory, pids ** I got around this for now by just not adding pids to the list and skipping the others in the new methods. We are not leaking cgroups for these controllers. * I am still leaking cgroups under /sys/fs/cgroup/systemd ** Even if I add "systemd" as one of the supported controllers, our mount-tab parsing code does not find it because it's not really a controller. * This feels pretty hacky - it might be better to just add a new dockerCGroupResourceHandler (as I mentioned above) to do effectively the same thing - we'd have to supply the list of controllers in a config property and deal with systemd. The way things are right now we would still have to add these to the list of supported controllers, because most of the interfaces are based on a controller enum. But even moving it to a separateResourceHandler still seems hacky. * I haven't tested the mount-cgroup path yet, but I believe we would need to configure all of the controllers that we need to mount in container-executor.cfg. The main advantage to something along these lines is that it preserves the existing cgroups hiearchy, and no additional code is needed to deal with cgroup parameters. The other advantage is that we are pre-creating the hadoop-yarn cgroups with the correct owner/permissions - docker creates them as root. At this point, I'm not sure if I should proceed with this approach and I'm looking for opinions. The options I am considering are: # The approach I've been experimenting with, cleaned up # The minimal, just-fix-the-leak approach, which would be to add a cleanupCGroups() method to the runtime. ** We call it after calling the ResourceHandlers.postComplete() in LCE. ** Docker would be the only runtime that implements it. ** We'd need to add a container-executor function to handle it. ** It could search for the containerId cgroup under all mounted cgroups and delete any that it finds *** Would not delete any that still have processes *** Security concerns? # The let-docker-be-docker approach ** This is the change-the-cgroup-parent approach. Instead of passing /hadoop-yarn/containerId, we would just use /hadoop-yarn and let docker create its dockerContainerId cgroups under there. ** Solves the leak by just letting docker handle it - no intermediate containerId cgroups are created, so they don't need to be deleted by NM. ** To do this, I think we'd need to change every Cgroups ResourceHandler to do something different for Docker. The main ones are for blkio and cpu. *** Don't create the containerId cgroups *** Don't modify cgroup params directly. *** Return the /hadoop-yarn/tasks path for the ADD_PID_TO_CGROUP operation so we set the cgroup parent correctly. *** Would likely need to add new PriviledgedOps for each cgroup parameter to pass them through (these are returned by ResourceHandler.preStart()). *** Add code to add each new cgroup parameter to docker run. *** Would need to support updating params via docker update command to support the ResourceHandler.updateContainer() method. *** [~billie.rinaldi], I've thought a bit more about the docker in docker case, which we thought would be a problem with this approach. I think it is solvable though - you can obtain the name of the docker cgroup from /proc/self/cgroup. I don't know if this is workable for your use-case though? Comments? Concerns? Alternatives? cc:[~jlowe], [~ebadger], [~shaneku...@gmail.com], [~billie.rinaldi], [~eyang] > Container cgroups are leaked when using docker >
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596345#comment-16596345 ] Jim Brennan commented on YARN-8648: --- Looks like this is ready for review. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599213#comment-16599213 ] Jim Brennan commented on YARN-8648: --- [~jlowe] thanks for the review! {quote}Why was the postComplete call moved in reapContainer to before the container is removed via docker? Shouldn't docker first remove its cgroups for the container before we remove ours? {quote} I was trying to preserve the order of operations. Normally postComplete is called immediately after the launchContainer() returns, and then reapContainer() is called as part of cleanupContainer() processing. So the resource handlers usually get a chance to clean up cgroups before we cleanup the container. If we do the docker cleanup first, it will delete the cgroups before the resource handler postComplete processing - it doesn't know which ones are handled by resource handlers, so it just deletes them all. Since they both are really just deleting the cgroups, I don't think the order matters that much, so I will move it back if you think that is better. {quote}Is there a reason to separate removing docker cgroups from removing the docker container? This seems like a natural extension to cleaning up after a container run by docker, and that's already covered by the reap command. The patch would remain a docker-only change but without needing to modify the container-executor interface. {quote} It is currently being done as part of the reap processing, but as a separate privileged operation. We definitely could just add this processing to the remove-docker-container processing in container-executor, but it would require adding the yarn-hierarchy as an additional argument for the DockerRmCommand. This would also require changing the DockerContainerDeletionTask() to store the yarn-hierarchy String along with the ContainerId. Despite the additional container-executor interface, I think the current approach is less code/simpler, but I'm definitely willing to rework it if you think it is a better solution. {quote}Nit: PROC_MOUNT_PATH should be a macro (i.e.: #define) or lower-cased. Similar for CGROUP_MOUNT. {quote} I will fix these. {quote}The snprintf result should be checked for truncation in addition to output errors (i.e.: result >= PATH_MAX means it was truncated) otherwise we formulate an incomplete path targeted for deletion if that somehow occurs. Alternatively the code could use make_string or asprintf to allocate an appropriately sized buffer for each entry rather than trying to reuse a manually sized buffer. {quote} I will fix this. I forgot about make_string(). {quote}Is there any point in logging to the error file that a path we want to delete has already been deleted? This seems like it will just be noise, especially if systemd or something else is periodically cleaning some of these empty cgroups. {quote} I'll remove it - was nice while debugging, but not needed. {quote}Related to the previous comment, the rmdir result should be checked for ENOENT and treat that as success. {quote} I explicitly check that the directory exists before calling rmdir, so I'm not sure this is necessary, but I can add it anyway. {quote}Nit: I think lineptr should be freed in the cleanup label in case someone later adds a fatal error that jumps to cleanup. {quote} Will do. Thanks again for the review! > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates >
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599235#comment-16599235 ] Jim Brennan commented on YARN-8648: --- {quote} I explicitly check that the directory exists before calling rmdir, so I'm not sure this is necessary, but I can add it anyway. {quote} There is a small window where it could be removed between the exist check and the rmdir, so it is necessary. I'm tempted to just remove the dir_exists() check. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603606#comment-16603606 ] Jim Brennan commented on YARN-8648: --- Put up another patch to fix the checkstyle issue. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch, YARN-8648.002.patch, > YARN-8648.003.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8648: -- Attachment: YARN-8648.003.patch > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch, YARN-8648.002.patch, > YARN-8648.003.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8648: -- Attachment: YARN-8648.004.patch > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch, YARN-8648.002.patch, > YARN-8648.003.patch, YARN-8648.004.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608212#comment-16608212 ] Jim Brennan commented on YARN-8648: --- I have uploaded a patch that adds the cgroup cleanup to the DockerRmCommand. This also includes some fixes for exec_docker_command() in container-executor.c. * No longer passes optind, which was eclipsing the global variable of the same name * Fix stack overrun error - was allocating using sizeof(char) instead of sizeof(char *) * Use optind for indexing into argv instead of assuming args start at 2. Added a wrapper function remove_docker_container() that forks and calls exec_docker_command() in the child so I could add the cgroup cleanup after it finishes. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch, YARN-8648.002.patch, > YARN-8648.003.patch, YARN-8648.004.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603202#comment-16603202 ] Jim Brennan commented on YARN-8648: --- I've uploaded a patch that addresses most of issues raised by [~jlowe] - except for moving the functionality to the Docker RM command - I wanted to put up these other changes before reworking that part. I misspoke in my earlier comment - I don't think any change would be needed to DockerContainerDeletionService, because it ends up calling LinuxContainerExecutor.removeDockerContainer(), which can lookup the yarn-hierarchy. My only reservation to moving this to the DockerRmCommand is that most (if not all) arguments to Docker*Commands are actual command line arguments for the docker command. This would be an exception to that. Not sure how much that matters, because I agree this cleanup does naturally align with removing the container. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch, YARN-8648.002.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8648: -- Attachment: YARN-8648.002.patch > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch, YARN-8648.002.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609206#comment-16609206 ] Jim Brennan commented on YARN-8648: --- This is ready for review. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch, YARN-8648.002.patch, > YARN-8648.003.patch, YARN-8648.004.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539286#comment-16539286 ] Jim Brennan commented on YARN-8515: --- Here is an example case that we saw: Docker ps info for this container: {noformat} 968e4a1a0fca 90188f3d752e "bash /grid/4/tmp/..." 6 days ago Exited (143) 6 days ago container_e07_1528760012992_2875921_01_69 {noformat} NM Log with some added info from Docker container and journalctl to show where the docker container started/exited: {noformat} 2018-06-27 16:32:48,779 [IPC Server handler 9 on 8041] INFO containermanager.ContainerManagerImpl: Start request for container_e07_1528760012992_2875921_01_69 by user p_condor 2018-06-27 16:32:48,782 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Adding container_e07_1528760012992_2875921_01_69 to application application_1528760012992_2875921 2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from NEW to LOCALIZING 2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Initializing container container_e07_1528760012992_2875921_01_69 2018-06-27 16:32:48,786 [AsyncDispatcher event handler] INFO localizer.ResourceLocalizationService: Created localizer for container_e07_1528760012992_2875921_01_69 2018-06-27 16:32:48,786 [LocalizerRunner for container_e07_1528760012992_2875921_01_69] INFO localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /grid/4/tmp/yarn-local/nmPrivate/container_e07_1528760012992_2875921_01_69.tokens. Credentials list: 2018-06-27 16:32:52,654 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to LOCALIZED 2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to RUNNING 2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e07_1528760012992_2875921_01_69 2018-06-27 16:32:53.345 Docker container started 2018-06-27 16:32:54,429 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id container_e07_1528760012992_2875921_01_69: 132.5 MB of 3 GB physical memory used; 4.3 GB of 6.3 GB virtual memory used 2018-06-27 16:33:25,422 [main] INFO nodemanager.NodeManager: STARTUP_MSG: / STARTUP_MSG: Starting NodeManager STARTUP_MSG: user = mapred STARTUP_MSG: host = gsbl607n22.blue.ygrid.yahoo.com/10.213.59.232 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.8.3.2.1806111934 2018-06-27 16:33:31,140 [main] INFO containermanager.ContainerManagerImpl: Recovering container_e07_1528760012992_2875921_01_69 in state LAUNCHED with exit code -1000 2018-06-27 16:33:31,140 [main] INFO application.ApplicationImpl: Adding container_e07_1528760012992_2875921_01_69 to application application_1528760012992_2875921 2018-06-27 16:33:32,771 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,280 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,178 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,776 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from NEW to LOCALIZING 2018-06-27 16:33:34,393 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Initializing container container_e07_1528760012992_2875921_01_69 2018-06-27 16:33:34,433 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to LOCALIZED 2018-06-27 16:33:34,461 [ContainersLauncher #23] INFO nodemanager.ContainerExecutor: Reacquiring container_e07_1528760012992_2875921_01_69 with pid 103072 2018-06-27 16:33:34,463 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to RUNNING 2018-06-27 16:33:34,482 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e07_1528760012992_2875921_01_69 2018-06-27 16:33:35,304 [main] INFO nodemanager.NodeStatusUpdaterImpl: Sending out 598 NM container statuses: 2018-06-27 16:33:35,356 [main] INFO nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers 2018-06-27 16:33:35,902 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id
[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
Jim Brennan created YARN-8515: - Summary: container-executor can crash with SIGPIPE after nodemanager restart Key: YARN-8515 URL: https://issues.apache.org/jira/browse/YARN-8515 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan When running with docker on large clusters, we have noticed that sometimes docker containers are not removed - they remain in the exited state, and the corresponding container-executor is no longer running. Upon investigation, we noticed that this always seemed to happen after a nodemanager restart. The sequence leading to the stranded docker containers is: # Nodemanager restarts # Containers are recovered and then run for a while # Containers are killed for some (legitimate) reason # Container-executor exits without removing the docker container. After reproducing this on a test cluster, we found that the container-executor was exiting due to a SIGPIPE. What is happening is that the shell command executor that is used to start container-executor has threads reading from c-e's stdout and stderr. When the NM is restarted, these threads are killed. Then when the container-executor continues executing after the container exits with error, it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is not handled, this crashes the container-executor before it can actually remove the docker container. We ran into this in branch 2.8. The way docker containers are removed has been completely redesigned in trunk, so I don't think it will lead to this exact failure, but after an NM restart, potentially any write to stderr or stdout in the container-executor could cause it to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539290#comment-16539290 ] Jim Brennan commented on YARN-8515: --- I have been able to repro this reliably on a test cluster. Repro steps are: # Start sleep job with a lot of mappers sleeping for 50 seconds # on one worker node, kill NM after a set of containers starts # restart the NM # On the gw, kill the application (before the current containers finish) This will leave the containers on the node where the nodemanager was restarted in the exited state. container-executor is not cleaning up the docker containers. Here is an strace of one of the container-executors when the application is killed: {noformat} -bash-4.2$ sudo strace -s 4096 -f -p 7176 strace: Process 7176 attached read(3, "143\n", 4096) = 4 close(3) = 0 wait4(7566, [\{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 7566 --- SIGCHLD \{si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7566, si_uid=0, si_status=0, si_utime=1, si_stime=0} --- munmap(0x7f233bfa4000, 4096) = 0 write(2, "Docker container exit code was not zero: 143\n", 45) = -1 EPIPE (Broken pipe) --- SIGPIPE \{si_signo=SIGPIPE, si_code=SI_USER, si_pid=7176, si_uid=0} --- +++ killed by SIGPIPE +++ {noformat} The problem is that when container-executor is started by the NM using the priviledged operation executor, it attaches stream readers to stdout and stderr. When we restart the NM, these threads are killed. Then when the application is killed, it kills the running containers and container-executor returns from waiting for the docker container. When it tries to write an error message to stderr, it generates a SIGPIPE signal, because the other end of the pipe has been killed. Since we are not handling that signal, container-executor crashes and we never remove the docker container. I have verified that if I change container-executor to ignore SIGPIPE, the problem does not occur. > container-executor can crash with SIGPIPE after nodemanager restart > --- > > Key: YARN-8515 > URL: https://issues.apache.org/jira/browse/YARN-8515 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > When running with docker on large clusters, we have noticed that sometimes > docker containers are not removed - they remain in the exited state, and the > corresponding container-executor is no longer running. Upon investigation, > we noticed that this always seemed to happen after a nodemanager restart. > The sequence leading to the stranded docker containers is: > # Nodemanager restarts > # Containers are recovered and then run for a while > # Containers are killed for some (legitimate) reason > # Container-executor exits without removing the docker container. > After reproducing this on a test cluster, we found that the > container-executor was exiting due to a SIGPIPE. > What is happening is that the shell command executor that is used to start > container-executor has threads reading from c-e's stdout and stderr. When > the NM is restarted, these threads are killed. Then when the > container-executor continues executing after the container exits with error, > it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is > not handled, this crashes the container-executor before it can actually > remove the docker container. > We ran into this in branch 2.8. The way docker containers are removed has > been completely redesigned in trunk, so I don't think it will lead to this > exact failure, but after an NM restart, potentially any write to stderr or > stdout in the container-executor could cause it to crash. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8518) test-container-executor test_is_empty() is broken
Jim Brennan created YARN-8518: - Summary: test-container-executor test_is_empty() is broken Key: YARN-8518 URL: https://issues.apache.org/jira/browse/YARN-8518 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan A new test was recently added to test-container-executor.c that has some problems. It is attempting to mkdir() a hard-coded path: /tmp/2938rf2983hcqnw8ud/emptydir This fails because the base directory is not there. These directories are not being cleaned up either. It should be using TEST_ROOT. I don't know what Jira this change was made under - the git commit from July 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken
[ https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540597#comment-16540597 ] Jim Brennan commented on YARN-8518: --- [~rkanter], [~szegedim], let me know if you would like me to put up a patch for this. > test-container-executor test_is_empty() is broken > - > > Key: YARN-8518 > URL: https://issues.apache.org/jira/browse/YARN-8518 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Priority: Major > > A new test was recently added to test-container-executor.c that has some > problems. > It is attempting to mkdir() a hard-coded path: > /tmp/2938rf2983hcqnw8ud/emptydir > This fails because the base directory is not there. These directories are > not being cleaned up either. > It should be using TEST_ROOT. > I don't know what Jira this change was made under - the git commit from July > 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken
[ https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542171#comment-16542171 ] Jim Brennan commented on YARN-8518: --- [~rkanter], can you please review this fix? > test-container-executor test_is_empty() is broken > - > > Key: YARN-8518 > URL: https://issues.apache.org/jira/browse/YARN-8518 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8518.001.patch > > > A new test was recently added to test-container-executor.c that has some > problems. > It is attempting to mkdir() a hard-coded path: > /tmp/2938rf2983hcqnw8ud/emptydir > This fails because the base directory is not there. These directories are > not being cleaned up either. > It should be using TEST_ROOT. > I don't know what Jira this change was made under - the git commit from July > 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8518) test-container-executor test_is_empty() is broken
[ https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan reassigned YARN-8518: - Assignee: Jim Brennan > test-container-executor test_is_empty() is broken > - > > Key: YARN-8518 > URL: https://issues.apache.org/jira/browse/YARN-8518 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > A new test was recently added to test-container-executor.c that has some > problems. > It is attempting to mkdir() a hard-coded path: > /tmp/2938rf2983hcqnw8ud/emptydir > This fails because the base directory is not there. These directories are > not being cleaned up either. > It should be using TEST_ROOT. > I don't know what Jira this change was made under - the git commit from July > 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8515: -- Attachment: YARN-8515.001.patch > container-executor can crash with SIGPIPE after nodemanager restart > --- > > Key: YARN-8515 > URL: https://issues.apache.org/jira/browse/YARN-8515 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8515.001.patch > > > When running with docker on large clusters, we have noticed that sometimes > docker containers are not removed - they remain in the exited state, and the > corresponding container-executor is no longer running. Upon investigation, > we noticed that this always seemed to happen after a nodemanager restart. > The sequence leading to the stranded docker containers is: > # Nodemanager restarts > # Containers are recovered and then run for a while > # Containers are killed for some (legitimate) reason > # Container-executor exits without removing the docker container. > After reproducing this on a test cluster, we found that the > container-executor was exiting due to a SIGPIPE. > What is happening is that the shell command executor that is used to start > container-executor has threads reading from c-e's stdout and stderr. When > the NM is restarted, these threads are killed. Then when the > container-executor continues executing after the container exits with error, > it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is > not handled, this crashes the container-executor before it can actually > remove the docker container. > We ran into this in branch 2.8. The way docker containers are removed has > been completely redesigned in trunk, so I don't think it will lead to this > exact failure, but after an NM restart, potentially any write to stderr or > stdout in the container-executor could cause it to crash. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8518) test-container-executor test_is_empty() is broken
[ https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8518: -- Attachment: YARN-8518.001.patch > test-container-executor test_is_empty() is broken > - > > Key: YARN-8518 > URL: https://issues.apache.org/jira/browse/YARN-8518 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8518.001.patch > > > A new test was recently added to test-container-executor.c that has some > problems. > It is attempting to mkdir() a hard-coded path: > /tmp/2938rf2983hcqnw8ud/emptydir > This fails because the base directory is not there. These directories are > not being cleaned up either. > It should be using TEST_ROOT. > I don't know what Jira this change was made under - the git commit from July > 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken
[ https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541894#comment-16541894 ] Jim Brennan commented on YARN-8518: --- The unit test failure is not related to this change and it looks like there is a Jira for it YARN-5857 I think this is ready for review. > test-container-executor test_is_empty() is broken > - > > Key: YARN-8518 > URL: https://issues.apache.org/jira/browse/YARN-8518 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8518.001.patch > > > A new test was recently added to test-container-executor.c that has some > problems. > It is attempting to mkdir() a hard-coded path: > /tmp/2938rf2983hcqnw8ud/emptydir > This fails because the base directory is not there. These directories are > not being cleaned up either. > It should be using TEST_ROOT. > I don't know what Jira this change was made under - the git commit from July > 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8518) test-container-executor test_is_empty() is broken
[ https://issues.apache.org/jira/browse/YARN-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541878#comment-16541878 ] Jim Brennan commented on YARN-8518: --- I can confirm that it is running this test for pre-commit builds - I just hit this failure on YARN-8515. > test-container-executor test_is_empty() is broken > - > > Key: YARN-8518 > URL: https://issues.apache.org/jira/browse/YARN-8518 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8518.001.patch > > > A new test was recently added to test-container-executor.c that has some > problems. > It is attempting to mkdir() a hard-coded path: > /tmp/2938rf2983hcqnw8ud/emptydir > This fails because the base directory is not there. These directories are > not being cleaned up either. > It should be using TEST_ROOT. > I don't know what Jira this change was made under - the git commit from July > 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541868#comment-16541868 ] Jim Brennan commented on YARN-8515: --- The unit test failure is YARN-8518. Might want to wait for that one to go through before we continue with this one, just to see that test-container-executor succeeds. I tested this manually, running several test jobs and restarting the NM while jobs were running. Because trunk has [~shaneku...@gmail.com]'s docker life-cycle changes, I don't see the same failure I saw on branch 2.8, but the patch does not introduce any new problems that I can see. > container-executor can crash with SIGPIPE after nodemanager restart > --- > > Key: YARN-8515 > URL: https://issues.apache.org/jira/browse/YARN-8515 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8515.001.patch > > > When running with docker on large clusters, we have noticed that sometimes > docker containers are not removed - they remain in the exited state, and the > corresponding container-executor is no longer running. Upon investigation, > we noticed that this always seemed to happen after a nodemanager restart. > The sequence leading to the stranded docker containers is: > # Nodemanager restarts > # Containers are recovered and then run for a while > # Containers are killed for some (legitimate) reason > # Container-executor exits without removing the docker container. > After reproducing this on a test cluster, we found that the > container-executor was exiting due to a SIGPIPE. > What is happening is that the shell command executor that is used to start > container-executor has threads reading from c-e's stdout and stderr. When > the NM is restarted, these threads are killed. Then when the > container-executor continues executing after the container exits with error, > it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is > not handled, this crashes the container-executor before it can actually > remove the docker container. > We ran into this in branch 2.8. The way docker containers are removed has > been completely redesigned in trunk, so I don't think it will lead to this > exact failure, but after an NM restart, potentially any write to stderr or > stdout in the container-executor could cause it to crash. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381033#comment-16381033 ] Jim Brennan commented on YARN-7677: --- Uploaded another patch that fixes the extra import reported by checkstyle. As noted for previous patches, I am not going to fix the too many arguments checkstyle issues, as adding an argument to writeLaunchEnv and sanitizeEnv is appropriate for this change. The unit test failure for TestContainerSchedulerQueuing is a separate issue: [YARN-7700] > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-7677: -- Attachment: YARN-7677.007.patch > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7677) Docker image cannot set HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/YARN-7677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381129#comment-16381129 ] Jim Brennan commented on YARN-7677: --- Check-style issues are expected, as noted above. Unit test failure is tracked by YARN-7700 [~jlowe], this is ready for review. > Docker image cannot set HADOOP_CONF_DIR > --- > > Key: YARN-7677 > URL: https://issues.apache.org/jira/browse/YARN-7677 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Eric Badger >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-7677.001.patch, YARN-7677.002.patch, > YARN-7677.003.patch, YARN-7677.004.patch, YARN-7677.005.patch, > YARN-7677.006.patch, YARN-7677.007.patch > > > Currently, {{HADOOP_CONF_DIR}} is being put into the task environment whether > it's set by the user or not. It completely bypasses the whitelist and so > there is no way for a task to not have {{HADOOP_CONF_DIR}} set. This causes > problems in the Docker use case where Docker containers will set up their own > environment and have their own {{HADOOP_CONF_DIR}} preset in the image > itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
Jim Brennan created YARN-8027: - Summary: Setting hostname of docker container breaks for --net=host in docker 1.13 Key: YARN-8027 URL: https://issues.apache.org/jira/browse/YARN-8027 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.0.0 Reporter: Jim Brennan Assignee: Jim Brennan In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname argument to the docker run command to set the hostname in the container to something like: ctr-e84-1520889172376-0001-01-01. This does not work when combined with the --net=host command line option in Docker 1.13.1. It causes multiple failures when the client tries to resolve the hostname and it fails. We haven't seen this before because we were using docker 1.12.6 which seems to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396115#comment-16396115 ] Jim Brennan commented on YARN-8027: --- This code was added by [YARN-6804]. [~billie.rinaldi], [~jianh], I don't think we should be setting --hostname when --net=host. Do you agree? > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397007#comment-16397007 ] Jim Brennan commented on YARN-8027: --- {quote}We should look into whether it is a bug in that version of Docker. I see a couple of tickets regarding adding support for setting hostname when net=host, which would indicate that is a valid setting. I have not dug far enough to determine which versions are supposed to support it. {quote} [~billie.rinaldi], I think it is actually the opposite. Specifying --hostname with --net=host was broken before docker 1.13.1, which is why it didn't cause us a problem. In 1.13.1 though, it works, which breaks our ability to resolve the hostname, since we are not using Registry DNS. I agree with [~jlowe] and [~shaneku...@gmail.com], we should only set the hostname when Registry DNS is enabled, as long as this is indeed always the case. We haven't experimented with user-defined networks here - is it the case that Registry DNS must always be used for user-defined networks? > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8027: -- Attachment: YARN-8027.001.patch > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
[ https://issues.apache.org/jira/browse/YARN-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399006#comment-16399006 ] Jim Brennan commented on YARN-8029: --- There was a related discussion in [HADOOP-11640] about allowing the user to specify an alternate delimiter or adding an escaping mechanism. I think in this case, the better solution would be to change the docker runtime environment variables to use a different separator - semicolon, or pipe, or something else. > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators > > > Key: YARN-8029 > URL: https://issues.apache.org/jira/browse/YARN-8029 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Priority: Major > > The following docker-related environment variables specify a comma-separated > list of mounts: > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS > This is a problem because hadoop -Dmapreduce.map.env and related options use > comma as a delimiter. So if I put more than one mount in > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be > treated as a delimiter for the hadoop command line option and all but the > first mount will be ignored. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
Jim Brennan created YARN-8029: - Summary: YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators Key: YARN-8029 URL: https://issues.apache.org/jira/browse/YARN-8029 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.0.0 Reporter: Jim Brennan The following docker-related environment variables specify a comma-separated list of mounts: YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS This is a problem because hadoop -Dmapreduce.map.env and related options use comma as a delimiter. So if I put more than one mount in YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be treated as a delimiter for the hadoop command line option and all but the first mount will be ignored. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398752#comment-16398752 ] Jim Brennan commented on YARN-8027: --- My thinking was that the only known case where there is a problem is with --net=host, so I was keeping the change narrowed to that case. With network set to bridge or none, the default hostname for the container is the container id, and it is not resolvable inside the container, so changing it to a more useful name seems relatively harmless. For user defined networks, I'm unsure if there is a case where we would want to set the container name without using Registry DNS. I'm happy to simplify this to just check Registry DNS if [~shaneku...@gmail.com] and [~billie.rinaldi] agree that is the best solution. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398635#comment-16398635 ] Jim Brennan commented on YARN-8027: --- The unit test failure (testKillOpportunisticForGuaranteedContainer) does not appear to be related to my changes. [~jlowe], [~shaneku...@gmail.com], [~billie.rinaldi], I believe this is ready for review. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399355#comment-16399355 ] Jim Brennan commented on YARN-8027: --- [~suma.shivaprasad], thanks for your comment. It sounds like the current patch would be ok with you then? It preserves the current behavior except in the case where network is host and Registry DNS is not enabled. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
[ https://issues.apache.org/jira/browse/YARN-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399240#comment-16399240 ] Jim Brennan commented on YARN-8029: --- Thanks [~shaneku...@gmail.com]. This does appear to be a duplicate of YARN-6830, with respect to the underlying problem, but proposes a different solution. Supporting the ability to quote the values does seem like a natural approach - it's the first thing I tried to do. I proposed changing the delimiters in these docker runtime variables because it is a safe change - it can't break anything because it's currently not working with commas. While commas seems like a natural choice for the delimiter, I don't think changing it to something else would be much of a hardship as long as it is documented. I'm willing to work on either this or YARN-6830, depending on which option is favored. cc: [~jlowe], [~templedf], [~aw], > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators > > > Key: YARN-8029 > URL: https://issues.apache.org/jira/browse/YARN-8029 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Priority: Major > > The following docker-related environment variables specify a comma-separated > list of mounts: > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS > This is a problem because hadoop -Dmapreduce.map.env and related options use > comma as a delimiter. So if I put more than one mount in > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be > treated as a delimiter for the hadoop command line option and all but the > first mount will be ignored. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8064) Docker ".cmd" files should not be put in hadoop.tmp.dir
[ https://issues.apache.org/jira/browse/YARN-8064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428841#comment-16428841 ] Jim Brennan commented on YARN-8064: --- [~ebadger], one question - why are we retaining the old version of writeCommandToTempFile(), which is still being used by executeDockerCommand()? Might be good to have comments that describe under which conditions each version should be used. > Docker ".cmd" files should not be put in hadoop.tmp.dir > --- > > Key: YARN-8064 > URL: https://issues.apache.org/jira/browse/YARN-8064 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-8064.001.patch, YARN-8064.002.patch, > YARN-8064.003.patch, YARN-8064.004.patch, YARN-8064.005.patch > > > Currently all of the docker command files are being put into > {{hadoop.tmp.dir}}, which doesn't get cleaned up. So, eventually all of the > inodes will fill up and no more tasks will be able to run -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6830) Support quoted strings for environment variables
[ https://issues.apache.org/jira/browse/YARN-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425644#comment-16425644 ] Jim Brennan commented on YARN-6830: --- Solution proposed by [~aw] for mapreduce variables is being addressed in MAPREDUCE-7069. > Support quoted strings for environment variables > > > Key: YARN-6830 > URL: https://issues.apache.org/jira/browse/YARN-6830 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-6830.001.patch, YARN-6830.002.patch, > YARN-6830.003.patch, YARN-6830.004.patch > > > There are cases where it is necessary to allow for quoted string literals > within environment variables values when passed via the yarn command line > interface. > For example, consider the follow environment variables for a MR map task. > {{MODE=bar}} > {{IMAGE_NAME=foo}} > {{MOUNTS=/tmp/foo,/tmp/bar}} > When running the MR job, these environment variables are supplied as a comma > delimited string. > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > In this case, {{MOUNTS}} will be parsed and added to the task environment as > {{MOUNTS=/tmp/foo}}. Any attempts to quote the embedded comma separated value > results in quote characters becoming part of the value, and parsing still > breaks down at the comma. > This issue is to allow for quoting the comma separated value (escaped double > or single quote). This was mentioned on YARN-4595 and will impact YARN-5534 > as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan reopened YARN-8027: --- Reopening so I can put up a patch for branch 3. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8027: -- Attachment: YARN-8027-branch-3.001.patch > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.001.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8027: -- Attachment: YARN-8027-branch-3.0.001.patch > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.0.001.patch, > YARN-8027-branch-3.001.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8027: -- Attachment: (was: YARN-8027-branch-3.001.patch) > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.0.001.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435712#comment-16435712 ] Jim Brennan commented on YARN-8027: --- Renamed branch-3 patch for branch-3.0. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.0.001.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6434) When setting environment variables, can't use comma for a list of value in key = value pairs.
[ https://issues.apache.org/jira/browse/YARN-6434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436088#comment-16436088 ] Jim Brennan commented on YARN-6434: --- [~Jaeboo], please close this if you agree that it is resolved by [MAPREDUCE-7069]. > When setting environment variables, can't use comma for a list of value in > key = value pairs. > - > > Key: YARN-6434 > URL: https://issues.apache.org/jira/browse/YARN-6434 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jaeboo Jeong >Priority: Major > Attachments: YARN-6434-trunk.001.patch, YARN-6434.001.patch > > > We can set environment variables using yarn.app.mapreduce.am.env, > mapreduce.map.env, mapreduce.reduce.env. > There is no problem if we use key=value pairs like X=Y, X=$Y. > However If we want to set key=a list of value pair(e.g. X=Y,Z), we can’t. > This is related to YARN-4595. > The attached patch is based on YARN-3768. > We can set environment variables like below. > {code} > mapreduce.map.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker,YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS=\"/dir1:/targetdir1,/dir2:/targetdir2\"" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435891#comment-16435891 ] Jim Brennan commented on YARN-8027: --- Missed fixing that test that randomly picks which network to use. It will fail when the network happens to be 'host'. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.0.001.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435906#comment-16435906 ] Jim Brennan commented on YARN-8027: --- Submitted new branch-3.0 patch that fixes the broken TestDockerContainerRuntime test. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.0.001.patch, > YARN-8027-branch-3.0.002.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8027: -- Attachment: YARN-8027-branch-3.0.002.patch > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.0.001.patch, > YARN-8027-branch-3.0.002.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6434) When setting environment variables, can't use comma for a list of value in key = value pairs.
[ https://issues.apache.org/jira/browse/YARN-6434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436084#comment-16436084 ] Jim Brennan commented on YARN-6434: --- This issue was resolved in a different way in MAPREDUCE-7069. You can now specify variables that have commas in them individually, e.g., {{mapreduce.map.env.VARNAME=value}}. > When setting environment variables, can't use comma for a list of value in > key = value pairs. > - > > Key: YARN-6434 > URL: https://issues.apache.org/jira/browse/YARN-6434 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jaeboo Jeong >Priority: Major > Attachments: YARN-6434-trunk.001.patch, YARN-6434.001.patch > > > We can set environment variables using yarn.app.mapreduce.am.env, > mapreduce.map.env, mapreduce.reduce.env. > There is no problem if we use key=value pairs like X=Y, X=$Y. > However If we want to set key=a list of value pair(e.g. X=Y,Z), we can’t. > This is related to YARN-4595. > The attached patch is based on YARN-3768. > We can set environment variables like below. > {code} > mapreduce.map.env="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker,YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS=\"/dir1:/targetdir1,/dir2:/targetdir2\"" > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
[ https://issues.apache.org/jira/browse/YARN-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436076#comment-16436076 ] Jim Brennan commented on YARN-8027: --- [~jlowe], this one is ready for review. > Setting hostname of docker container breaks for --net=host in docker 1.13 > - > > Key: YARN-8027 > URL: https://issues.apache.org/jira/browse/YARN-8027 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.1.0 > > Attachments: YARN-8027-branch-3.0.001.patch, > YARN-8027-branch-3.0.002.patch, YARN-8027.001.patch > > > In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname > argument to the docker run command to set the hostname in the container to > something like: ctr-e84-1520889172376-0001-01-01. > This does not work when combined with the --net=host command line option in > Docker 1.13.1. It causes multiple failures when the client tries to resolve > the hostname and it fails. > We haven't seen this before because we were using docker 1.12.6 which seems > to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436207#comment-16436207 ] Jim Brennan commented on YARN-8071: --- MAPREDUCE-7069 resolved this problem for the following properties: {quote}mapreduce.map.env.VARNAME=value mapreduce.reduce.env.VARNAME=value yarn.app.mapreduce.am.env.VARNAME=value yarn.app.mapreduce.am.admin.user.env.VARNAME=value {quote} The remaining YARN environment variable property is: {{yarn.nodemanager.admin-env}} I am planning to use this Jira to add support for the {{yarn.nodemanager.admin-env.VARNAME=value}} syntax to allow variables with commas to be specified for this property. [~jlowe], [~shaneku...@gmail.com], please let me know if you agree this is needed, and also if I'm missing any other properties. > Provide Spark-like API for setting Environment Variables to enable vars with > commas > --- > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8071: -- Attachment: YARN-8071.001.patch > Add ability to specify nodemanager environment variables individually > - > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8071.001.patch > > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8071: -- Summary: Add ability to specify nodemanager environment variables individually (was: Provide Spark-like API for setting Environment Variables to enable vars with commas) > Add ability to specify nodemanager environment variables individually > - > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8071) Add ability to specify nodemanager environment variables individually
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436387#comment-16436387 ] Jim Brennan commented on YARN-8071: --- Changed the description to be more accurate about what this Jira will address. > Add ability to specify nodemanager environment variables individually > - > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8071) Add ability to specify nodemanager environment variables individually
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437537#comment-16437537 ] Jim Brennan commented on YARN-8071: --- [~jlowe], I believe this patch is ready for review. > Add ability to specify nodemanager environment variables individually > - > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8071.001.patch, YARN-8071.002.patch > > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} > The mapreduce properties were dealt with in [MAPREDUCE-7069]. This Jira will > address the YARN properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8071: -- Description: YARN-6830 describes a problem where environment variables that contain commas cannot be specified via {{-Dmapreduce.map.env}}. For example: {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} will set {{MOUNTS}} to {{/tmp/foo}} In that Jira, [~aw] suggested that we change the API to provide a way to specify environment variables individually, the same way that Spark does. {quote}Rather than fight with a regex why not redefine the API instead? -Dmapreduce.map.env.MODE=bar -Dmapreduce.map.env.IMAGE_NAME=foo -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar ... e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar This greatly simplifies the input validation needed and makes it clear what is actually being defined. {quote} The mapreduce properties were dealt with in [MAPREDUCE-7069]. This Jira will address the YARN properties. was: YARN-6830 describes a problem where environment variables that contain commas cannot be specified via {{-Dmapreduce.map.env}}. For example: {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} will set {{MOUNTS}} to {{/tmp/foo}} In that Jira, [~aw] suggested that we change the API to provide a way to specify environment variables individually, the same way that Spark does. {quote}Rather than fight with a regex why not redefine the API instead? -Dmapreduce.map.env.MODE=bar -Dmapreduce.map.env.IMAGE_NAME=foo -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar ... e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar This greatly simplifies the input validation needed and makes it clear what is actually being defined. {quote} > Add ability to specify nodemanager environment variables individually > - > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8071.001.patch > > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} > The mapreduce properties were dealt with in [MAPREDUCE-7069]. This Jira will > address the YARN properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8071) Add ability to specify nodemanager environment variables individually
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-8071: -- Attachment: YARN-8071.002.patch > Add ability to specify nodemanager environment variables individually > - > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8071.001.patch, YARN-8071.002.patch > > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} > The mapreduce properties were dealt with in [MAPREDUCE-7069]. This Jira will > address the YARN properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7667) Docker Stop grace period should be configurable
[ https://issues.apache.org/jira/browse/YARN-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428497#comment-16428497 ] Jim Brennan commented on YARN-7667: --- Patch looks good to me. > Docker Stop grace period should be configurable > --- > > Key: YARN-7667 > URL: https://issues.apache.org/jira/browse/YARN-7667 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-7667.001.patch, YARN-7667.002.patch, > YARN-7667.003.patch > > > {{DockerStopCommand}} has a {{setGracePeriod}} method, but it is never > called. So, the stop uses the 10 second default grace period from docker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446405#comment-16446405 ] Jim Brennan commented on YARN-7654: --- I'm not going to repeat all of the arguments, but I agree with [~jlowe] and [~ebadger]. The main point I would like to add is that [~eyang]'s proposal seems to rest on the assumption that we will do YARN-8097, exposing the {{--env-file-}} option to the end-user. I don't agree that this is necessary nor desired. IIRC, YARN-8097 was filed in response to one of [~jlowe]'s earlier reviews of this Jira, where he recommended using {{-env-file}} *instead of* a list of {{-e key=value}} pairs. [~jlowe]'s original comment on this (which I still find very compelling): {quote}Actually now that I think about this more, I think we can side step the pipe character hack, the comma problems, the refactoring of the command file, etc., if we leverage the --env-file feature of docker-run. Rather than try to pass untrusted user data on the docker-run command-line and the real potential of accidentally letting some of these "variables" appear as different command-line directives to docker, we can dump the variables to a new file next to the existing command file that contains the environment variable settings, one variable per line. Then we just pass --env-file with the path to the file. That way Docker will never misinterpret this data as anything but environment variables, we don't have to mess with pipe encoding to try to get these variables marshalled through the command file before they get to the container-executor, and we don't have to worry about how to properly marshal them on the command-line for the docker command. As a bonus, I think that precludes needing to refactor the container-executor to do the argument array stuff since we're not trying to pass user-specified env variables on the command-line. That lets us make this JIRA a lot smaller and more focused, and we can move the execv changes to a separate JIRA that wouldn't block this one. {quote} I do not see any value in providing two ways to specify environment variables to docker, and the {{--env-file}} approach is much cleaner and easier to maintain in code. Perhaps we should consider YARN-8097 on its own. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8097) Add support for Docker env-file switch
[ https://issues.apache.org/jira/browse/YARN-8097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446414#comment-16446414 ] Jim Brennan commented on YARN-8097: --- [~eyang], [~jlowe], [~ebadger], [~shaneku...@gmail.com], my understanding is that this Jira was filed in response to a comment from [~jlowe] in YARN-7654 where he recommended using {{--env-file}} instead of {{-e key=value}} pairs. I don't think it was [~jlowe]'s intent to expose this capability to the end-user as another way of providing environment variables. > Add support for Docker env-file switch > -- > > Key: YARN-8097 > URL: https://issues.apache.org/jira/browse/YARN-8097 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.2.0 >Reporter: Eric Yang >Priority: Major > Attachments: YARN-8097.001.patch > > > There are two different ways to pass user environment variables to docker. > There is -e flag and --env-file which reference to a file that contains > environment variables key/value pair. It would be nice to have a way to > express env-file from HDFS, and localize the .env file in container localized > directory and pass --env-file flag to docker run command. This approach > would prevent ENV based password to show up in log file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8071) Add ability to specify nodemanager environment variables individually
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439791#comment-16439791 ] Jim Brennan commented on YARN-8071: --- [~jlowe] thanks for the review. {quote}The original code passed the current environment map, allowing admin variables to reference other variables defined so far in the environment. The new code passes an empty map which would seem to preclude this and could be a backwards compatibility issue. {quote} Thanks for pointing this out. I meant to ask specifically about this change. This was intentional. I agree it is a change in functionality, but it seemed to me that the current behavior may actually be a bug, not the intended behavior. I based this on the documentation for {{yarn.nodemanager.admin-env}}, the comment that precedes this code ({{variables here will be forced in, even if the container has specified them.}}, and the fact that everything else in this function overrides any user-specified variable (with the exception of the windows-specific classpath stuff). That said, I don't have a good idea of how likely this change would be to break something, so I am definitely willing to change it if it is considered too dangerous. {quote}The changes to TestContainerLaunch#testPrependDistcache appear to be unnecessary? {quote} They were intentional. When I was testing my new test case, I realized that passing the empty set for the {{nmVars}} argument leads to exceptions in {{addToEnvMap()}}, so I fixed the testPrependDistcache() cases as well - I assume this windows-only test must be failing without this fix. > Add ability to specify nodemanager environment variables individually > - > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8071.001.patch, YARN-8071.002.patch > > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} > The mapreduce properties were dealt with in [MAPREDUCE-7069]. This Jira will > address the YARN properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415782#comment-16415782 ] Jim Brennan commented on YARN-8071: --- [~jlowe], yes I think this will affect mapreduce, yarn, and common code. I haven't done the analysis yet to figure out everything this will affect. Should this be refiled in hadoop common, or or should we add additional components to this Jira? > Provide Spark-like API for setting Environment Variables to enable vars with > commas > --- > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
[ https://issues.apache.org/jira/browse/YARN-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419392#comment-16419392 ] Jim Brennan commented on YARN-8029: --- Based on discussions in [YARN-6830], the preference is to provide a solution that allows the use of commas for these variables. So we are not going to do this. > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators > > > Key: YARN-8029 > URL: https://issues.apache.org/jira/browse/YARN-8029 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-8029.001.patch, YARN-8029.002.patch > > > The following docker-related environment variables specify a comma-separated > list of mounts: > YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS > This is a problem because hadoop -Dmapreduce.map.env and related options use > comma as a delimiter. So if I put more than one mount in > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be > treated as a delimiter for the hadoop command line option and all but the > first mount will be ignored. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas
[ https://issues.apache.org/jira/browse/YARN-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417505#comment-16417505 ] Jim Brennan commented on YARN-8071: --- @jlowe, I've filed [MAPREDUCE-7069] for addressing the mapreduce properties. I will use this one to address the yarn properties. > Provide Spark-like API for setting Environment Variables to enable vars with > commas > --- > > Key: YARN-8071 > URL: https://issues.apache.org/jira/browse/YARN-8071 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > YARN-6830 describes a problem where environment variables that contain commas > cannot be specified via {{-Dmapreduce.map.env}}. > For example: > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > will set {{MOUNTS}} to {{/tmp/foo}} > In that Jira, [~aw] suggested that we change the API to provide a way to > specify environment variables individually, the same way that Spark does. > {quote}Rather than fight with a regex why not redefine the API instead? > > -Dmapreduce.map.env.MODE=bar > -Dmapreduce.map.env.IMAGE_NAME=foo > -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar > ... > e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar > This greatly simplifies the input validation needed and makes it clear what > is actually being defined. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6830) Support quoted strings for environment variables
[ https://issues.apache.org/jira/browse/YARN-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan updated YARN-6830: -- Attachment: YARN-6830.002.patch > Support quoted strings for environment variables > > > Key: YARN-6830 > URL: https://issues.apache.org/jira/browse/YARN-6830 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Assignee: Shane Kumpf >Priority: Major > Attachments: YARN-6830.001.patch, YARN-6830.002.patch > > > There are cases where it is necessary to allow for quoted string literals > within environment variables values when passed via the yarn command line > interface. > For example, consider the follow environment variables for a MR map task. > {{MODE=bar}} > {{IMAGE_NAME=foo}} > {{MOUNTS=/tmp/foo,/tmp/bar}} > When running the MR job, these environment variables are supplied as a comma > delimited string. > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > In this case, {{MOUNTS}} will be parsed and added to the task environment as > {{MOUNTS=/tmp/foo}}. Any attempts to quote the embedded comma separated value > results in quote characters becoming part of the value, and parsing still > breaks down at the comma. > This issue is to allow for quoting the comma separated value (escaped double > or single quote). This was mentioned on YARN-4595 and will impact YARN-5534 > as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6830) Support quoted strings for environment variables
[ https://issues.apache.org/jira/browse/YARN-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan reassigned YARN-6830: - Assignee: Jim Brennan (was: Shane Kumpf) > Support quoted strings for environment variables > > > Key: YARN-6830 > URL: https://issues.apache.org/jira/browse/YARN-6830 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-6830.001.patch, YARN-6830.002.patch, > YARN-6830.003.patch > > > There are cases where it is necessary to allow for quoted string literals > within environment variables values when passed via the yarn command line > interface. > For example, consider the follow environment variables for a MR map task. > {{MODE=bar}} > {{IMAGE_NAME=foo}} > {{MOUNTS=/tmp/foo,/tmp/bar}} > When running the MR job, these environment variables are supplied as a comma > delimited string. > {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} > In this case, {{MOUNTS}} will be parsed and added to the task environment as > {{MOUNTS=/tmp/foo}}. Any attempts to quote the embedded comma separated value > results in quote characters becoming part of the value, and parsing still > breaks down at the comma. > This issue is to allow for quoting the comma separated value (escaped double > or single quote). This was mentioned on YARN-4595 and will impact YARN-5534 > as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org