Re: Review Request 61530: Enabled retries for `killTasks` in docker executor.

2017-08-10 Thread Alexander Rukletsov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61530/#review182624
---


Ship it!




Great job, great testing. Thanks a lot!

- Alexander Rukletsov


On Aug. 10, 2017, 4:14 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61530/
> ---
> 
> (Updated Aug. 10, 2017, 4:14 p.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-6743
> https://issues.apache.org/jira/browse/MESOS-6743
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, after `docker stop` command failure, docker executor
> neither allowed a scheduler to retry `killTask` command, nor retried to
> kill a task when task killing was triggered by a failed health check.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 26f12ec002f754fab0d34c01472cf95b499d8007 
> 
> 
> Diff: https://reviews.apache.org/r/61530/diff/2/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> Manual testing:
> 
> Emulating `docker stop` errors:
> ===
> 1. Add `return fmt.Errorf("Emulating error!")` to 
> https://github.com/docker/docker-ce/blob/master/components/engine/daemon/stop.go#L21
> 2. build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 3. stop docker service and launch dockerd binary, like: `sudo 
> ./bundles/17.06.0-dev/binary-daemon/dockerd`
> 
> Emulating docker daemon hang:
> =
> 1. `ps aux|grep dockerd` - 2 processes will be found
> 2. `sudo kill -STOP  ` - send SIGSTOP to docker daemon processes 
> just before sending `docker stop`
> 
> Emulating health check failure in docker executor:
> ==
> 1. Add
> ```c++
>   static int fake = 0;
>   if (++fake > 10) {
> failure();
> return;
>   }
> ```
> to `HealthChecker::processCheckResult()` in `src/checks/health_checker.cpp`
> 2. Add
> ```c++
>HealthCheck healthCheck;
>healthCheck.set_type(HealthCheck::COMMAND);
>healthCheck.mutable_command()->set_value("exit 0");
>healthCheck.set_delay_seconds(0);
>healthCheck.set_interval_seconds(0);
>healthCheck.set_grace_period_seconds(1);
>_task.mutable_health_check()->CopyFrom(healthCheck);
> ```
> to `CommandScheduler::offers()` in `src/cli/execute.cpp`
> 3. compile mesos
> 4. run mesos agent: `sudo GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/some_user/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 5. launch docker executor: `./src/mesos-execute --master="127.0.1.1:5050" 
> --name="a" --containerizer=docker --docker_image="ubuntu:xenial" 
> --command="while true; do : ; done"`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 61530: Enabled retries for `killTasks` in docker executor.

2017-08-10 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61530/
---

(Updated Aug. 10, 2017, 4:14 p.m.)


Review request for mesos and Alexander Rukletsov.


Bugs: MESOS-6743
https://issues.apache.org/jira/browse/MESOS-6743


Repository: mesos


Description
---

Previously, after `docker stop` command failure, docker executor
neither allowed a scheduler to retry `killTask` command, nor retried to
kill a task when task killing was triggered by a failed health check.


Diffs (updated)
-

  src/docker/executor.cpp 26f12ec002f754fab0d34c01472cf95b499d8007 


Diff: https://reviews.apache.org/r/61530/diff/2/

Changes: https://reviews.apache.org/r/61530/diff/1-2/


Testing
---

sudo make check

Manual testing:

Emulating `docker stop` errors:
===
1. Add `return fmt.Errorf("Emulating error!")` to 
https://github.com/docker/docker-ce/blob/master/components/engine/daemon/stop.go#L21
2. build docker from sources: 
http://oyvindsk.com/writing/docker-build-from-source
3. stop docker service and launch dockerd binary, like: `sudo 
./bundles/17.06.0-dev/binary-daemon/dockerd`

Emulating docker daemon hang:
=
1. `ps aux|grep dockerd` - 2 processes will be found
2. `sudo kill -STOP  ` - send SIGSTOP to docker daemon processes 
just before sending `docker stop`

Emulating health check failure in docker executor:
==
1. Add
```c++
  static int fake = 0;
  if (++fake > 10) {
failure();
return;
  }
```
to `HealthChecker::processCheckResult()` in `src/checks/health_checker.cpp`
2. Add
```c++
   HealthCheck healthCheck;
   healthCheck.set_type(HealthCheck::COMMAND);
   healthCheck.mutable_command()->set_value("exit 0");
   healthCheck.set_delay_seconds(0);
   healthCheck.set_interval_seconds(0);
   healthCheck.set_grace_period_seconds(1);
   _task.mutable_health_check()->CopyFrom(healthCheck);
```
to `CommandScheduler::offers()` in `src/cli/execute.cpp`
3. compile mesos
4. run mesos agent: `sudo GLOG_v=1 ./bin/mesos-agent.sh 
--resources="cpus:1;mem:100" 
--work_dir='/home/some_user/mesos/build/var/agent-1' 
--containerizers="docker,mesos" --master="127.0.1.1:5050"`
5. launch docker executor: `./src/mesos-execute --master="127.0.1.1:5050" 
--name="a" --containerizer=docker --docker_image="ubuntu:xenial" 
--command="while true; do : ; done"`


Thanks,

Andrei Budnik



Re: Review Request 61530: Enabled retries for `killTasks` in docker executor.

2017-08-10 Thread Alexander Rukletsov


> On Aug. 10, 2017, 12:07 p.m., Alexander Rukletsov wrote:
> > src/docker/executor.cpp
> > Lines 410-415 (original), 416-421 (patched)
> > 
> >
> > Let's add a comment explaining why are we doing retry / unblock on 
> > failure and why not timeout. Something like:
> > ```
> > // Invoking `docker stop` might be unsuccessful, in which case the 
> > container most probably does not receive the signal. In this case we should 
> > allow schedulers to retry the kill operation or, if the kill was initiated 
> > by a failing health check, retry ourselves. We do not bail out nor stop 
> > retrying to avoid sending a terminal status update while the container 
> > might still be running.
> > //
> > // NOTE: `docker stop` might also hang. We do not address this for now, 
> > because there is no evidence that in this case docker daemon might funciton 
> > properly, i.e., it is only the docker cli command that hangs, and hence 
> > there is not so much we can do.
> > ```

We should also refer to https://issues.apache.org/jira/browse/MESOS-6743 in the 
comment so that folks can get more context.


- Alexander


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61530/#review182573
---


On Aug. 9, 2017, 4:55 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61530/
> ---
> 
> (Updated Aug. 9, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-6743
> https://issues.apache.org/jira/browse/MESOS-6743
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, after `docker stop` command failure, docker executor
> neither allowed a scheduler to retry `killTask` command, nor retried to
> kill a task when task killing was triggered by a failed health check.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 26f12ec002f754fab0d34c01472cf95b499d8007 
> 
> 
> Diff: https://reviews.apache.org/r/61530/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> Manual testing:
> 
> Emulating `docker stop` errors:
> ===
> 1. Add `return fmt.Errorf("Emulating error!")` to 
> https://github.com/docker/docker-ce/blob/master/components/engine/daemon/stop.go#L21
> 2. build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 3. stop docker service and launch dockerd binary, like: `sudo 
> ./bundles/17.06.0-dev/binary-daemon/dockerd`
> 
> Emulating docker daemon hang:
> =
> 1. `ps aux|grep dockerd` - 2 processes will be found
> 2. `sudo kill -STOP  ` - send SIGSTOP to docker daemon processes 
> just before sending `docker stop`
> 
> Emulating health check failure in docker executor:
> ==
> 1. Add
> ```c++
>   static int fake = 0;
>   if (++fake > 10) {
> failure();
> return;
>   }
> ```
> to `HealthChecker::processCheckResult()` in `src/checks/health_checker.cpp`
> 2. Add
> ```c++
>HealthCheck healthCheck;
>healthCheck.set_type(HealthCheck::COMMAND);
>healthCheck.mutable_command()->set_value("exit 0");
>healthCheck.set_delay_seconds(0);
>healthCheck.set_interval_seconds(0);
>healthCheck.set_grace_period_seconds(1);
>_task.mutable_health_check()->CopyFrom(healthCheck);
> ```
> to `CommandScheduler::offers()` in `src/cli/execute.cpp`
> 3. compile mesos
> 4. run mesos agent: `sudo GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/some_user/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 5. launch docker executor: `./src/mesos-execute --master="127.0.1.1:5050" 
> --name="a" --containerizer=docker --docker_image="ubuntu:xenial" 
> --command="while true; do : ; done"`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 61530: Enabled retries for `killTasks` in docker executor.

2017-08-10 Thread Alexander Rukletsov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61530/#review182573
---




src/docker/executor.cpp
Lines 31 (patched)


We sort includes alphabetically.



src/docker/executor.cpp
Lines 410-415 (original), 416-421 (patched)


Let's add a comment explaining why are we doing retry / unblock on failure 
and why not timeout. Something like:
```
// Invoking `docker stop` might be unsuccessful, in which case the 
container most probably does not receive the signal. In this case we should 
allow schedulers to retry the kill operation or, if the kill was initiated by a 
failing health check, retry ourselves. We do not bail out nor stop retrying to 
avoid sending a terminal status update while the container might still be 
running.
//
// NOTE: `docker stop` might also hang. We do not address this for now, 
because there is no evidence that in this case docker daemon might funciton 
properly, i.e., it is only the docker cli command that hangs, and hence there 
is not so much we can do.
```



src/docker/executor.cpp
Lines 427-428 (patched)


Please no "magic constants" in the code. Add it to the existing section 
above. `5s` sounds reasonable to me.



src/docker/executor.cpp
Lines 604-606 (original), 616-618 (patched)


We should augment the comment here explaining why we need 
`killAttemptInProgress` or whatever we end up having.



src/docker/executor.cpp
Lines 606-608 (original), 618-621 (patched)


Instead of `killed` flag, which can be rolled back, and `firstKillAttempt`, 
that kinda indicates that a kill has been issued before, how about having 
`killed` and `killAttemptInProgress` with the following semantics:
- After task transitions to `killed`, there is no way back. This 
corresponds to our intent. This also fixes the race with the health update that 
might come in after we set `killed = false` to unblock schedulers.
- `killAttemptInProgress` will guard kill requests from schedulers and will 
be adjusted accordingly.


- Alexander Rukletsov


On Aug. 9, 2017, 4:55 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61530/
> ---
> 
> (Updated Aug. 9, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-6743
> https://issues.apache.org/jira/browse/MESOS-6743
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, after `docker stop` command failure, docker executor
> neither allowed a scheduler to retry `killTask` command, nor retried to
> kill a task when task killing was triggered by a failed health check.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 26f12ec002f754fab0d34c01472cf95b499d8007 
> 
> 
> Diff: https://reviews.apache.org/r/61530/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> Manual testing:
> 
> Emulating `docker stop` errors:
> ===
> 1. Add `return fmt.Errorf("Emulating error!")` to 
> https://github.com/docker/docker-ce/blob/master/components/engine/daemon/stop.go#L21
> 2. build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 3. stop docker service and launch dockerd binary, like: `sudo 
> ./bundles/17.06.0-dev/binary-daemon/dockerd`
> 
> Emulating docker daemon hang:
> =
> 1. `ps aux|grep dockerd` - 2 processes will be found
> 2. `sudo kill -STOP  ` - send SIGSTOP to docker daemon processes 
> just before sending `docker stop`
> 
> Emulating health check failure in docker executor:
> ==
> 1. Add
> ```c++
>   static int fake = 0;
>   if (++fake > 10) {
> failure();
> return;
>   }
> ```
> to `HealthChecker::processCheckResult()` in `src/checks/health_checker.cpp`
> 2. Add
> ```c++
>HealthCheck healthCheck;
>healthCheck.set_type(HealthCheck::COMMAND);
>healthCheck.mutable_command()->set_value("exit 0");
>healthCheck.set_delay_seconds(0);
>healthCheck.set_interval_seconds(0);
>healthCheck.set_grace_period_seconds(1);
>_task.mutable_health_check()->CopyFrom(healthCheck);
> ```
> to `CommandScheduler::offers()` in `src/cli/execute.cpp`
> 3. compile mesos
> 4. run mesos agent: `sudo GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/some_user/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 5. launch