[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082158#comment-15082158
 ] 

Jun Gong commented on YARN-4459:
--------------------------------

Thanks [~Naganarasimha] for the review, very appreciate it!

{quote}
IIUC existing code checks whether container process has created any sub process 
then kill all the process, else if its a single process then i presume 
kill(-pid,0) will return -1 then it tries to kill only the container process id 
only. Can you confirm this by testing?
{quote}
I do not have a good method to add a test case for it now. I will try to 
explain it by following test cases:
1. If container's parent process does not exist and its child process does 
exist, existing code works well.
{code}
root@bd74d89a2294:/$ cat test.sh 
sleep 6666 &
PID_FOR_SLEEP=$!
PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
echo "PID for 'sleep 6666' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
root@bd74d89a2294:/$ ./test.sh 
PID for 'sleep 6666' : 26877, its pgid : 26876
root@bd74d89a2294:/$ ps -ef | grep sleep
root     26877     1  0 08:48 pts/0    00:00:00 sleep 6666
root     26880 26797  0 08:48 pts/0    00:00:00 grep sleep
root@bd74d89a2294:/$ kill -0 -26876
root@bd74d89a2294:/$ kill -15 -26876
root@bd74d89a2294:/$ ps -ef | grep sleep
root     26882 26797  0 08:48 pts/0    00:00:00 grep sleep
{code}

2. If container's parent process does not exist and its child process does not 
exist either, existing code will kill process wrongly.
{code}
root@bd74d89a2294:/$ cat test.sh 
sleep 2 &
PID_FOR_SLEEP=$!
PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
echo "PID for 'sleep 2' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
root@bd74d89a2294:/$ ./test.sh 
PID for 'sleep 2' : 26890, its pgid : 26889
root@bd74d89a2294:/$ ps -ef | grep sleep
root     26893 26797  0 08:56 pts/0    00:00:00 grep sleep
root@bd74d89a2294:/$ kill -0 26889
-bash: kill: (26889) - No such process
{code}
Then we check existing code in container-executor.c for the above case:
{code}
  if (kill(-pid,0) < 0) {
    if (kill(pid, 0) < 0) {
      if (errno == ESRCH) {
        return INVALID_CONTAINER_PID;
      }
      fprintf(LOGFILE, "Error signalling container %d with %d - %s\n",
              pid, sig, strerror(errno));
      return -1;
    } else {
      has_group = 0;
    }
  }

  if (kill((has_group ? -1 : 1) * pid, sig) < 0) {
{code}
*kill(-pid,0)* will return -1. If *pid* is reused for a new process(suppose 
that the container has survived for a long time, pid recycle occurs, then this 
pid might be reused again), *kill(pid, 0)* will return 0, then *has_group* will 
be set to 0, and the code *kill((has_group ? -1 : 1) * pid, sig)* will try to 
kill *pid* with *sig*. This is the problem.

> container-executor might kill process wrongly
> ---------------------------------------------
>
>                 Key: YARN-4459
>                 URL: https://issues.apache.org/jira/browse/YARN-4459
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to