[jira] [Comment Edited] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-26 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558
 ] 

Andrei Budnik edited comment on MESOS-9887 at 8/26/19 12:22 PM:


[https://reviews.apache.org/r/71361/
https://reviews.apache.org/r/71343/|https://reviews.apache.org/r/71343/]


was (Author: abudnik):
https://reviews.apache.org/r/71343/

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>  stores TASK_FAILED terminal status update in 

[jira] [Comment Edited] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-26 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558
 ] 

Andrei Budnik edited comment on MESOS-9887 at 8/26/19 12:22 PM:


[https://reviews.apache.org/r/71361/]
[https://reviews.apache.org/r/71343/]


was (Author: abudnik):
[https://reviews.apache.org/r/71361/
https://reviews.apache.org/r/71343/|https://reviews.apache.org/r/71343/]

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>