[ https://issues.apache.org/jira/browse/YARN-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342744#comment-14342744 ]
Abin Shahab commented on YARN-3080: ----------------------------------- [~beckham007] Thanks for your comment. So you're saying that if I send a signal to the pid of the session script(as DefaultContainerExecutor does), it will work on the process that docker is running, and potentially kill it? Please help me clarify my understanding: I am running the following steps: First I create a file similar to the session script. It writes the pid of the session to a pidfile {code} $cat > bash_session_pid.sh <<EOF > #!/bin/bash > echo $$ > /tmp/pidfile > exec setsid bash -c 'docker run -itd ubuntu sleep infinity' > EOF {code} I chmod and run this script which starts a docker container {code} $chmod a+x bash_session_pid.sh $./bash_session_pid.sh $docker ps 1b8ee377e3d2 ubuntu:14.04 "sleep infinity" 3 minutes ago Up 3 minutes cranky_stallman {code} Now I cat the pid of the session, and it says the pid is 9281 {code} $cat /tmp/pidfile 9281 {code} As you've suggested, I send a kill signal to the pid, hoping that'd kill the container {code} $kill -9 9281 {code} I check if the docker container is killed: {code} $docker ps 1b8ee377e3d2 ubuntu:14.04 "sleep infinity" 6 minutes ago Up 6 minutes cranky_stallman {code} Since your method did not kill the container, I get the pid of the process running under the container: {code} $docker inspect 1b8ee377e3d2 9289 {code} I check the tree of this process: {code} $pstree -ps 9289 init(1)---docker(6512)---sleep(9289) {code} As I had expected, this process is a child of the docker daemon, and therefore, if it's killed, the container will be killed. Therefore, I send a kill signal to this pid: {code} $kill -9 9289 {code} Now I verify if the container is alive: {code} $docker ps {code} Container is dead. >From what I understand, the session pid has no relation to the actual pid of >the container, and therefore, sending it signal is meaningless. Therefore, if that meaningless pid is in the pidfile, NodeManager/ResourceManager will not be able to send signal to containers as needed. Please let me know where my understanding is mistaken, and I gladly will switch it to the simpler implementation. > The DockerContainerExecutor could not write the right pid to container pidFile > ------------------------------------------------------------------------------ > > Key: YARN-3080 > URL: https://issues.apache.org/jira/browse/YARN-3080 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.6.0 > Reporter: Beckham007 > Assignee: Abin Shahab > Attachments: YARN-3080.patch, YARN-3080.patch, YARN-3080.patch, > YARN-3080.patch > > > The docker_container_executor_session.sh is like this: > {quote} > #!/usr/bin/env bash > echo `/usr/bin/docker inspect --format {{.State.Pid}} > container_1421723685222_0008_01_000002` > > /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_000002/container_1421723685222_0008_01_000002.pid.tmp > /bin/mv -f > /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_000002/container_1421723685222_0008_01_000002.pid.tmp > > /data/nm_restart/hadoop-2.4.1/data/yarn/local/nmPrivate/application_1421723685222_0008/container_1421723685222_0008_01_000002/container_1421723685222_0008_01_000002.pid > /usr/bin/docker run --rm --name container_1421723685222_0008_01_000002 -e > GAIA_HOST_IP=c162 -e GAIA_API_SERVER=10.6.207.226:8080 -e > GAIA_CLUSTER_ID=shpc-nm_restart -e GAIA_QUEUE=root.tdwadmin -e > GAIA_APP_NAME=test_nm_docker -e GAIA_INSTANCE_ID=1 -e > GAIA_CONTAINER_ID=container_1421723685222_0008_01_000002 --memory=32M > --cpu-shares=1024 -v > /data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_000002:/data/nm_restart/hadoop-2.4.1/data/yarn/container-logs/application_1421723685222_0008/container_1421723685222_0008_01_000002 > -v > /data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_000002:/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_000002 > -P -e A=B --privileged=true docker.oa.com:8080/library/centos7 bash > "/data/nm_restart/hadoop-2.4.1/data/yarn/local/usercache/tdwadmin/appcache/application_1421723685222_0008/container_1421723685222_0008_01_000002/launch_container.sh" > {quote} > The DockerContainerExecutor use docker inspect before docker run, so the > docker inspect couldn't get the right pid for the docker, signalContainer() > and nm restart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)