Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-03-02 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198538
---


Ship it!




Ship It!

- Gilbert Song


On March 2, 2018, 7:15 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated March 2, 2018, 7:15 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/6/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-03-02 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/
---

(Updated March 2, 2018, 3:15 p.m.)


Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
Vinod Kone.


Bugs: MESOS-8574
https://issues.apache.org/jira/browse/MESOS-8574


Repository: mesos


Description
---

Previosly, if `docker inspect` command hanged, the docker container
ended up in an unkillable state. This patch adds a timeout for inspect
command after receiving `killTask` analogically to `reaped` handler.
In addition we've added a timeout for `docker stop` command. If docker
`stop` or `inspect` command times out, we discard the related future,
thus the docker library kills previously spawned docker cli subprocess.
As a result, a scheduler can retry `killTask` operation to handle
nasty docker bugs that lead to hanging docker cli.


Diffs (updated)
-

  src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 


Diff: https://reviews.apache.org/r/65713/diff/6/

Changes: https://reviews.apache.org/r/65713/diff/5-6/


Testing
---

internal CI

Manual testing:
1. Build docker from sources: 
http://oyvindsk.com/writing/docker-build-from-source
2. Modify `ContainerInspect` function from `docker/inspect.go`:
```
 func (daemon *Daemon) ContainerInspect(name string, size bool, version string) 
(interface{}, error) {
+   time.Sleep(10 * time.Second)
```
3. Modify `ContainerStop` function from `docker/stop.go`:
```
 func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
+   rand.Seed(time.Now().UTC().UnixNano())
+   if rand.Intn(2) == 0 {
+   time.Sleep(20 * time.Second)
+   }
```
4. Rebuild docker: `sudo make build && sudo make binary`
5. Stop system docker daemon: `sudo service docker stop`
6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
7. Modify `src/cli/execute.cpp`:
  a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
offer.agent_id());` after 
https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
  b) Add a new method `retryKill` to `CommandScheduler`:
```
  void retryKill(const TaskID& taskId, const AgentID& agentId)
  {
killTask(taskId, agentId);
delay(Seconds(6), self(), ::retryKill, taskId, agentId);
  }
```
8. Rebuild mesos
9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
--resources="cpus:1;mem:100" 
--work_dir='/home/abudnik/mesos/build/var/agent-1' 
--containerizers="docker,mesos" --master="127.0.1.1:5050"`
11. Submit a task for the docker executor: `./src/mesos-execute 
--master="127.0.1.1:5050" --name="a" --containerizer=docker 
--docker_image="ubuntu:xenial" --command="sleep "`


Thanks,

Andrei Budnik



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-03-01 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198473
---


Fix it, then Ship it!





src/docker/executor.cpp
Lines 515-516 (original), 535-536 (patched)


Add comments:

docker stop may or may not finish. Our behavior is to give the subprocess a 
chance to finish until next time _killtask() isinvoked. Otherwise, the 
subprocess will not be discarded forever.



src/docker/executor.cpp
Lines 516-518 (original), 536-546 (patched)


Related to my previous comment:

`KILL_RETRY_INTERVAL` is for health check. probably we should only do 
`stop.after()` if `killedByHealthCheck` is true.



src/docker/executor.cpp
Line 518 (original), 538-539 (patched)


this log is not true for normal task kills that are invoked. E.g., if a 
task is not killed by health check, after the time out we log "Docker stop 
timed out ...", but actually the docker stop command is timed out/discarded 
when the next killtask is invoked. And it is even not true if a scheduler never 
retry killtask(). So this log is not true for a normal killtask(). we should 
log it for `killedBy HealthCheck` case only.


- Gilbert Song


On Feb. 27, 2018, 5:37 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 27, 2018, 5:37 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/5/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-03-01 Thread Greg Mann


> On March 1, 2018, 8:14 a.m., Gilbert Song wrote:
> > src/docker/executor.cpp
> > Lines 540-544 (patched)
> > 
> >
> > Should we just call stop.discard() here and return stop?
> > 
> > If it is not killed by health check, we timed out the docker stop and 
> > we should discard it. If it is killed by health check, .discard() would 
> > trigger we do os::killtree on that docker stop subprocess and then return a 
> > failure, which invokes the onFailed callback below and retry.
> 
> Andrei Budnik wrote:
> >Should we just call stop.discard() here and return stop?
> 
> If we trigger `onFailed` callback, then next `_killTask` will be 
> scheduled after 5 more seconds via `delay`, so in case of hanging docker, we 
> will retry every `2 * KILL_RETRY_INTERVAL`, instead of `KILL_RETRY_INTERVAL`.
> 
> >.discard() would trigger we do os::killtree on that docker stop 
> subprocess and then return a failure, which invokes the onFailed callback
> 
> I think it will call `onDiscarded` rather than `onFailed`.

That's correct - I implemented the discard behavior such that calling 
`.discard()` on the `stop` future will almost always transition it to the 
discarded state (when the Subprocess exits with any status code after we do 
killtree). In a rare case where the `process::reap()` used within `Subprocess` 
to reap the subprocess PID actually fails, then the `Future` could transition 
to the failed state.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198436
---


On Feb. 28, 2018, 1:37 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 28, 2018, 1:37 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/5/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-03-01 Thread Andrei Budnik


> On March 1, 2018, 8:14 a.m., Gilbert Song wrote:
> > src/docker/executor.cpp
> > Lines 467-468 (patched)
> > 
> >
> > In this case, should we log it?

We log inspect timeouts in the following patch inside inspect loop.


> On March 1, 2018, 8:14 a.m., Gilbert Song wrote:
> > src/docker/executor.cpp
> > Lines 540-544 (patched)
> > 
> >
> > Should we just call stop.discard() here and return stop?
> > 
> > If it is not killed by health check, we timed out the docker stop and 
> > we should discard it. If it is killed by health check, .discard() would 
> > trigger we do os::killtree on that docker stop subprocess and then return a 
> > failure, which invokes the onFailed callback below and retry.

>Should we just call stop.discard() here and return stop?

If we trigger `onFailed` callback, then next `_killTask` will be scheduled 
after 5 more seconds via `delay`, so in case of hanging docker, we will retry 
every `2 * KILL_RETRY_INTERVAL`, instead of `KILL_RETRY_INTERVAL`.

>.discard() would trigger we do os::killtree on that docker stop subprocess and 
>then return a failure, which invokes the onFailed callback

I think it will call `onDiscarded` rather than `onFailed`.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198436
---


On Feb. 28, 2018, 1:37 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 28, 2018, 1:37 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/5/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-03-01 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198436
---




src/docker/executor.cpp
Lines 467-468 (patched)


In this case, should we log it?



src/docker/executor.cpp
Lines 540-544 (patched)


Should we just call stop.discard() here and return stop?

If it is not killed by health check, we timed out the docker stop and we 
should discard it. If it is killed by health check, .discard() would trigger we 
do os::killtree on that docker stop subprocess and then return a failure, which 
invokes the onFailed callback below and retry.


- Gilbert Song


On Feb. 27, 2018, 5:37 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 27, 2018, 5:37 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/5/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-28 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198381
---




src/docker/executor.cpp
Lines 516-517 (patched)


Nit:
s/docker/Docker/g
s/cli/CLI/


- Greg Mann


On Feb. 28, 2018, 1:37 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 28, 2018, 1:37 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/4/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-28 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198380
---


Fix it, then Ship it!





src/docker/executor.cpp
Lines 516 (patched)


Nit:
s/a future/this future/


- Greg Mann


On Feb. 28, 2018, 1:37 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 28, 2018, 1:37 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/4/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-27 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/
---

(Updated Feb. 28, 2018, 1:37 a.m.)


Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
Vinod Kone.


Bugs: MESOS-8574
https://issues.apache.org/jira/browse/MESOS-8574


Repository: mesos


Description
---

Previosly, if `docker inspect` command hanged, the docker container
ended up in an unkillable state. This patch adds a timeout for inspect
command after receiving `killTask` analogically to `reaped` handler.
In addition we've added a timeout for `docker stop` command. If docker
`stop` or `inspect` command times out, we discard the related future,
thus the docker library kills previously spawned docker cli subprocess.
As a result, a scheduler can retry `killTask` operation to handle
nasty docker bugs that lead to hanging docker cli.


Diffs
-

  src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 


Diff: https://reviews.apache.org/r/65713/diff/4/


Testing
---

internal CI

Manual testing:
1. Build docker from sources: 
http://oyvindsk.com/writing/docker-build-from-source
2. Modify `ContainerInspect` function from `docker/inspect.go`:
```
 func (daemon *Daemon) ContainerInspect(name string, size bool, version string) 
(interface{}, error) {
+   time.Sleep(10 * time.Second)
```
3. Modify `ContainerStop` function from `docker/stop.go`:
```
 func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
+   rand.Seed(time.Now().UTC().UnixNano())
+   if rand.Intn(2) == 0 {
+   time.Sleep(20 * time.Second)
+   }
```
4. Rebuild docker: `sudo make build && sudo make binary`
5. Stop system docker daemon: `sudo service docker stop`
6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
7. Modify `src/cli/execute.cpp`:
  a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
offer.agent_id());` after 
https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
  b) Add a new method `retryKill` to `CommandScheduler`:
```
  void retryKill(const TaskID& taskId, const AgentID& agentId)
  {
killTask(taskId, agentId);
delay(Seconds(6), self(), ::retryKill, taskId, agentId);
  }
```
8. Rebuild mesos
9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
--resources="cpus:1;mem:100" 
--work_dir='/home/abudnik/mesos/build/var/agent-1' 
--containerizers="docker,mesos" --master="127.0.1.1:5050"`
11. Submit a task for the docker executor: `./src/mesos-execute 
--master="127.0.1.1:5050" --name="a" --containerizer=docker 
--docker_image="ubuntu:xenial" --command="sleep "`


Thanks,

Andrei Budnik



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-27 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/
---

(Updated Feb. 27, 2018, 10:34 p.m.)


Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
Vinod Kone.


Changes
---

Removed `killingInProgress` guard variable, because we discard `stop` command.


Bugs: MESOS-8574
https://issues.apache.org/jira/browse/MESOS-8574


Repository: mesos


Description
---

Previosly, if `docker inspect` command hanged, the docker container
ended up in an unkillable state. This patch adds a timeout for inspect
command after receiving `killTask` analogically to `reaped` handler.
In addition we've added a timeout for `docker stop` command. If docker
`stop` or `inspect` command times out, we discard the related future,
thus the docker library kills previously spawned docker cli subprocess.
As a result, a scheduler can retry `killTask` operation to handle
nasty docker bugs that lead to hanging docker cli.


Diffs (updated)
-

  src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 


Diff: https://reviews.apache.org/r/65713/diff/4/

Changes: https://reviews.apache.org/r/65713/diff/3-4/


Testing
---

internal CI

Manual testing:
1. Build docker from sources: 
http://oyvindsk.com/writing/docker-build-from-source
2. Modify `ContainerInspect` function from `docker/inspect.go`:
```
 func (daemon *Daemon) ContainerInspect(name string, size bool, version string) 
(interface{}, error) {
+   time.Sleep(10 * time.Second)
```
3. Modify `ContainerStop` function from `docker/stop.go`:
```
 func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
+   rand.Seed(time.Now().UTC().UnixNano())
+   if rand.Intn(2) == 0 {
+   time.Sleep(20 * time.Second)
+   }
```
4. Rebuild docker: `sudo make build && sudo make binary`
5. Stop system docker daemon: `sudo service docker stop`
6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
7. Modify `src/cli/execute.cpp`:
  a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
offer.agent_id());` after 
https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
  b) Add a new method `retryKill` to `CommandScheduler`:
```
  void retryKill(const TaskID& taskId, const AgentID& agentId)
  {
killTask(taskId, agentId);
delay(Seconds(6), self(), ::retryKill, taskId, agentId);
  }
```
8. Rebuild mesos
9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
--resources="cpus:1;mem:100" 
--work_dir='/home/abudnik/mesos/build/var/agent-1' 
--containerizers="docker,mesos" --master="127.0.1.1:5050"`
11. Submit a task for the docker executor: `./src/mesos-execute 
--master="127.0.1.1:5050" --name="a" --containerizer=docker 
--docker_image="ubuntu:xenial" --command="sleep "`


Thanks,

Andrei Budnik



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-23 Thread Andrei Budnik


> On Feb. 23, 2018, 6:24 p.m., Gilbert Song wrote:
> > src/docker/executor.cpp
> > Lines 319-321 (patched)
> > 
> >
> > Should we also call `.discard()` in this case?

We need to discard `inspect` only in `killTask()` and `reaped()` methods. So we 
give `inspect` a chance to complete, before a scheduler sends a `killTask` 
message.
BTW, this code is removed in the following patch in the chain.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198196
---


On Feb. 22, 2018, 11:28 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 22, 2018, 11:28 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-23 Thread Gilbert Song

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198196
---




src/docker/executor.cpp
Lines 319-321 (patched)


Should we also call `.discard()` in this case?


- Gilbert Song


On Feb. 22, 2018, 3:28 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 22, 2018, 3:28 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-23 Thread Andrei Budnik


> On Feb. 23, 2018, 8:42 a.m., Greg Mann wrote:
> > src/docker/executor.cpp
> > Lines 535 (patched)
> > 
> >
> > We should probably move this inside the `if (killedByHealthCheck)` 
> > conditional, since the kill is still in progress after this if it was not 
> > initiated by a health check.

If we move `killingInProgress` inside the conditional, all subsequent attempts 
to kill a task by a scheduler will be ignored.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198148
---


On Feb. 22, 2018, 11:28 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 22, 2018, 11:28 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-23 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198148
---




src/docker/executor.cpp
Lines 535 (patched)


We should probably move this inside the `if (killedByHealthCheck)` 
conditional, since the kill is still in progress after this if it was not 
initiated by a health check.


- Greg Mann


On Feb. 22, 2018, 11:28 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 22, 2018, 11:28 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-22 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/
---

(Updated Feb. 22, 2018, 11:28 p.m.)


Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
Vinod Kone.


Bugs: MESOS-8574
https://issues.apache.org/jira/browse/MESOS-8574


Repository: mesos


Description
---

Previosly, if `docker inspect` command hanged, the docker container
ended up in an unkillable state. This patch adds a timeout for inspect
command after receiving `killTask` analogically to `reaped` handler.
In addition we've added a timeout for `docker stop` command. If docker
`stop` or `inspect` command times out, we discard the related future,
thus the docker library kills previously spawned docker cli subprocess.
As a result, a scheduler can retry `killTask` operation to handle
nasty docker bugs that lead to hanging docker cli.


Diffs
-

  src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 


Diff: https://reviews.apache.org/r/65713/diff/2/


Testing
---

internal CI

Manual testing:
1. Build docker from sources: 
http://oyvindsk.com/writing/docker-build-from-source
2. Modify `ContainerInspect` function from `docker/inspect.go`:
```
 func (daemon *Daemon) ContainerInspect(name string, size bool, version string) 
(interface{}, error) {
+   time.Sleep(10 * time.Second)
```
3. Modify `ContainerStop` function from `docker/stop.go`:
```
 func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
+   rand.Seed(time.Now().UTC().UnixNano())
+   if rand.Intn(2) == 0 {
+   time.Sleep(20 * time.Second)
+   }
```
4. Rebuild docker: `sudo make build && sudo make binary`
5. Stop system docker daemon: `sudo service docker stop`
6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
7. Modify `src/cli/execute.cpp`:
  a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
offer.agent_id());` after 
https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
  b) Add a new method `retryKill` to `CommandScheduler`:
```
  void retryKill(const TaskID& taskId, const AgentID& agentId)
  {
killTask(taskId, agentId);
delay(Seconds(6), self(), ::retryKill, taskId, agentId);
  }
```
8. Rebuild mesos
9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
--resources="cpus:1;mem:100" 
--work_dir='/home/abudnik/mesos/build/var/agent-1' 
--containerizers="docker,mesos" --master="127.0.1.1:5050"`
11. Submit a task for the docker executor: `./src/mesos-execute 
--master="127.0.1.1:5050" --name="a" --containerizer=docker 
--docker_image="ubuntu:xenial" --command="sleep "`


Thanks,

Andrei Budnik



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-22 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/
---

(Updated Feb. 22, 2018, 11:19 p.m.)


Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
Vinod Kone.


Bugs: MESOS-8574
https://issues.apache.org/jira/browse/MESOS-8574


Repository: mesos


Description
---

Previosly, if `docker inspect` command hanged, the docker container
ended up in an unkillable state. This patch adds a timeout for inspect
command after receiving `killTask` analogically to `reaped` handler.
In addition we've added a timeout for `docker stop` command. If docker
`stop` or `inspect` command times out, we discard the related future,
thus the docker library kills previously spawned docker cli subprocess.
As a result, a scheduler can retry `killTask` operation to handle
nasty docker bugs that lead to hanging docker cli.


Diffs
-

  src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 


Diff: https://reviews.apache.org/r/65713/diff/2/


Testing
---

internal CI

Manual testing:
1. Build docker from sources: 
http://oyvindsk.com/writing/docker-build-from-source
2. Modify `ContainerInspect` function from `docker/inspect.go`:
```
 func (daemon *Daemon) ContainerInspect(name string, size bool, version string) 
(interface{}, error) {
+   time.Sleep(10 * time.Second)
```
3. Modify `ContainerStop` function from `docker/stop.go`:
```
 func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
+   rand.Seed(time.Now().UTC().UnixNano())
+   if rand.Intn(2) == 0 {
+   time.Sleep(20 * time.Second)
+   }
```
4. Rebuild docker: `sudo make build && sudo make binary`
5. Stop system docker daemon: `sudo service docker stop`
6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
7. Modify `src/cli/execute.cpp`:
  a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
offer.agent_id());` after 
https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
  b) Add a new method `retryKill` to `CommandScheduler`:
```
  void retryKill(const TaskID& taskId, const AgentID& agentId)
  {
killTask(taskId, agentId);
delay(Seconds(6), self(), ::retryKill, taskId, agentId);
  }
```
8. Rebuild mesos
9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
--resources="cpus:1;mem:100" 
--work_dir='/home/abudnik/mesos/build/var/agent-1' 
--containerizers="docker,mesos" --master="127.0.1.1:5050"`
11. Submit a task for the docker executor: `./src/mesos-execute 
--master="127.0.1.1:5050" --name="a" --containerizer=docker 
--docker_image="ubuntu:xenial" --command="sleep "`


Thanks,

Andrei Budnik



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-22 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198122
---



Patch looks great!

Reviews applied: [65713]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 
MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Feb. 20, 2018, 5:11 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 20, 2018, 5:11 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-22 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198120
---



PASS: Mesos patch 65713 was successfully built and tested.

Reviews applied: `['65713']`

All the build artifacts available at: 
http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65713

- Mesos Reviewbot Windows


On Feb. 20, 2018, 7:11 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 20, 2018, 7:11 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-21 Thread Alexander Rukletsov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review197841
---




src/docker/executor.cpp
Lines 436-437 (original), 436-437 (patched)


This comment is not strictly corrected: `run.isSome() == true` only means 
that we have launched a docker cli `run` command and nothing more. Can you 
please update the comment? And maybe also leave a note how `run` and `inspect` 
relate to task states where these fields are declared? Thanks!

As a side note (no call to action), a proper state machine with exlicit 
states would be great to have here.



src/docker/executor.cpp
Lines 451-455 (patched)


So instead of checking that inspect **always** finishes after 
`DOCKER_INSPECT_TIMEOUT`, we only check that if we want to kill the container, 
either via kill or reap. I can see the reason for this (we may want to give 
`docker inspect` enough time to crank up its whale machinery if the user gives 
us this time), but maybe still print a warning? For example, hang the following 
at the inspect call site:
```
inspect
  .after(DOCKER_INSPECT_TIMEOUT, [=](const Future& f) {
LOG(WARNING) << "docker inspect has not finished after " << 
DOCKER_INSPECT_TIMEOUT;
return f;
  });



src/docker/executor.cpp
Line 509 (original), 518-519 (patched)


"Docker stop timed out after " << KILL_RETRY_INTERVAL;

Also no need to specify the container name if you follow the suggestion 
below.



src/docker/executor.cpp
Lines 523-527 (patched)


Instead of this, why not make `stop.onFailed` below `stop.recover`?


- Alexander Rukletsov


On Feb. 20, 2018, 5:11 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 20, 2018, 5:11 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/1/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> 

Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-20 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review197788
---



Patch looks great!

Reviews applied: [65713]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 
MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Feb. 20, 2018, 5:11 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 20, 2018, 5:11 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/1/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 65713: Handled hanging docker `stop`, `inspect` commands in docker executor.

2018-02-20 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review197779
---



PASS: Mesos patch 65713 was successfully built and tested.

Reviews applied: `['65713']`

All the build artifacts available at: 
http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65713

- Mesos Reviewbot Windows


On Feb. 20, 2018, 3:18 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> ---
> 
> (Updated Feb. 20, 2018, 3:18 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, and Greg Mann.
> 
> 
> Bugs: MESOS-8574
> https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 80e2d81169f0d4303ca1ddbcef9fa87fe52601fc 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/1/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +   time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +   rand.Seed(time.Now().UTC().UnixNano())
> +   if rand.Intn(2) == 0 {
> +   time.Sleep(20 * time.Second)
> +   }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), ::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
> killTask(taskId, agentId);
> delay(Seconds(6), self(), ::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:1;mem:100" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep "`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>