Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-02-03 Thread Qian Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219470
---


Ship it!




Ship It!

- Qian Zhang


On Feb. 3, 2020, 9:35 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Feb. 3, 2020, 9:35 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/3/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-02-03 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/
---

(Updated Фев. 3, 2020, 1:35 п.п.)


Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
Kone.


Bugs: MESOS-9847
https://issues.apache.org/jira/browse/MESOS-9847


Repository: mesos


Description
---

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.


Diffs (updated)
-

  src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
  src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 


Diff: https://reviews.apache.org/r/72055/diff/3/

Changes: https://reviews.apache.org/r/72055/diff/2-3/


Testing
---

internal CI


Thanks,

Andrei Budnik



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-31 Thread Qian Zhang


> On Jan. 30, 2020, 10:31 a.m., Qian Zhang wrote:
> > src/docker/executor.cpp
> > Lines 779-788 (original), 779-789 (patched)
> > 
> >
> > This seems a bit redundant to me, I'd suggest to only have:
> > ```
> >   void _stop()
> >   {
> > driver.get()->stop();
> >   }
> > ```
> > 
> > And then call `delay(Seconds(60), self(), ::__stop);` in 
> > `_reaped()` and `launchTask()`.
> 
> Andrei Budnik wrote:
> I agree, but ideally we'd need to introduce a named constant for 
> `Seconds(60)` (since it appears more than once in the code). I noticed that 
> there is no such named constant for other executors.

Hmm, I am not sure if we want to introduce a constant for `Seconds(60)`, and I 
think that is a separate concern (it can be handled later in a separate patch 
if we decide to do it).


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219431
---


On Jan. 30, 2020, 12:23 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Jan. 30, 2020, 12:23 a.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-31 Thread Qian Zhang


> On Jan. 29, 2020, 6:28 p.m., Qian Zhang wrote:
> > The commit message seems not accurate to me:
> > > This could lead to termination of the executor before receiving all 
> > > status update acknowledgments from the agent.
> > 
> > I think the issue that we wanted to mitigate is, executor may shutdown 
> > itself before the terminal status update (rather than the acks) is sent to 
> > agent.
> 
> Andrei Budnik wrote:
> Updated the description.
> 
> Qian Zhang wrote:
> > This could lead to termination of the executor before processing of a 
> terminal status update by the agent.
> 
> What do you mean for `before processing of a terminal status update by 
> the agent`? Executor processes terminal status update sent by the agent? I 
> think it should be `before the terminal status update is sent to the agent`.
> 
> Andrei Budnik wrote:
> In the case of MESOS-9847, a terminal status update acknowledgment was 
> delivered to the agent, but the executor had been terminated before the agent 
> processed the status update in the `Slave::statusUpdate`.

OK, then I think we need to mention both of the two cases in the commit message.
1. Executor terminates before it sends terminal status update to agent. This 
may lead to a wrong terminal status update.
2. Executor terminates before agent finishes processing the terminal status 
update. This may lead to two terminal status updates.


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219410
---


On Jan. 30, 2020, 12:23 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Jan. 30, 2020, 12:23 a.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-30 Thread Andrei Budnik


> On Янв. 29, 2020, 10:28 д.п., Qian Zhang wrote:
> > The commit message seems not accurate to me:
> > > This could lead to termination of the executor before receiving all 
> > > status update acknowledgments from the agent.
> > 
> > I think the issue that we wanted to mitigate is, executor may shutdown 
> > itself before the terminal status update (rather than the acks) is sent to 
> > agent.
> 
> Andrei Budnik wrote:
> Updated the description.
> 
> Qian Zhang wrote:
> > This could lead to termination of the executor before processing of a 
> terminal status update by the agent.
> 
> What do you mean for `before processing of a terminal status update by 
> the agent`? Executor processes terminal status update sent by the agent? I 
> think it should be `before the terminal status update is sent to the agent`.

In the case of MESOS-9847, a terminal status update acknowledgment was 
delivered to the agent, but the executor had been terminated before the agent 
processed the status update in the `Slave::statusUpdate`.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219410
---


On Янв. 29, 2020, 4:23 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Янв. 29, 2020, 4:23 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-30 Thread Andrei Budnik


> On Янв. 30, 2020, 2:31 д.п., Qian Zhang wrote:
> > src/docker/executor.cpp
> > Lines 779-788 (original), 779-789 (patched)
> > 
> >
> > This seems a bit redundant to me, I'd suggest to only have:
> > ```
> >   void _stop()
> >   {
> > driver.get()->stop();
> >   }
> > ```
> > 
> > And then call `delay(Seconds(60), self(), ::__stop);` in 
> > `_reaped()` and `launchTask()`.

I agree, but ideally we'd need to introduce a named constant for `Seconds(60)` 
(since it appears more than once in the code). I noticed that there is no such 
named constant for other executors.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219431
---


On Янв. 29, 2020, 4:23 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Янв. 29, 2020, 4:23 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-30 Thread Andrei Budnik


> On Янв. 30, 2020, 12:18 д.п., Greg Mann wrote:
> > src/docker/executor.cpp
> > Lines 781-783 (original), 781-783 (patched)
> > 
> >
> > Is there a reason to have a failsafe here, but not in the default 
> > executor?

I added a fail-safe in the default executor too. Please see the latest version 
of `/r/72029/`.


> On Янв. 30, 2020, 12:18 д.п., Greg Mann wrote:
> > src/exec/exec.cpp
> > Lines 412-417 (patched)
> > 
> >
> > This seems like it's addressing a separate issue? If so, could we split 
> > it out into a separate patch?

The only reason I added this check is that we need to guard the condition 
statement below. Also, I doubt if this is an issue at all. So I think that it 
make sense to have them together in a single patch.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219427
---


On Янв. 29, 2020, 4:23 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Янв. 29, 2020, 4:23 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-29 Thread Qian Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219431
---




src/docker/executor.cpp
Lines 779-788 (original), 779-789 (patched)


This seems a bit redundant to me, I'd suggest to only have:
```
  void _stop()
  {
driver.get()->stop();
  }
```

And then call `delay(Seconds(60), self(), ::__stop);` in `_reaped()` 
and `launchTask()`.


- Qian Zhang


On Jan. 30, 2020, 12:23 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Jan. 30, 2020, 12:23 a.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-29 Thread Qian Zhang


> On Jan. 29, 2020, 6:28 p.m., Qian Zhang wrote:
> > The commit message seems not accurate to me:
> > > This could lead to termination of the executor before receiving all 
> > > status update acknowledgments from the agent.
> > 
> > I think the issue that we wanted to mitigate is, executor may shutdown 
> > itself before the terminal status update (rather than the acks) is sent to 
> > agent.
> 
> Andrei Budnik wrote:
> Updated the description.

> This could lead to termination of the executor before processing of a 
> terminal status update by the agent.

What do you mean for `before processing of a terminal status update by the 
agent`? Executor processes terminal status update sent by the agent? I think it 
should be `before the terminal status update is sent to the agent`.


- Qian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219410
---


On Jan. 30, 2020, 12:23 a.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Jan. 30, 2020, 12:23 a.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-29 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219427
---




src/docker/executor.cpp
Lines 781-783 (original), 781-783 (patched)


Is there a reason to have a failsafe here, but not in the default executor?



src/exec/exec.cpp
Lines 412-417 (patched)


This seems like it's addressing a separate issue? If so, could we split it 
out into a separate patch?


- Greg Mann


On Jan. 29, 2020, 4:23 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Jan. 29, 2020, 4:23 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-29 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219424
---



Patch looks great!

Reviews applied: [72055]

Passed command: export OS='ubuntu:16.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose --disable-libtool-wrappers 
--disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; 
./support/jenkins/buildbot.sh

- Mesos Reviewbot


On Jan. 29, 2020, 4:23 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Jan. 29, 2020, 4:23 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-29 Thread Andrei Budnik


> On Янв. 29, 2020, 10:28 д.п., Qian Zhang wrote:
> > The commit message seems not accurate to me:
> > > This could lead to termination of the executor before receiving all 
> > > status update acknowledgments from the agent.
> > 
> > I think the issue that we wanted to mitigate is, executor may shutdown 
> > itself before the terminal status update (rather than the acks) is sent to 
> > agent.

Updated the description.


> On Янв. 29, 2020, 10:28 д.п., Qian Zhang wrote:
> > src/docker/executor.cpp
> > Lines 786-787 (original)
> > 
> >
> > We have a fail safe in command executor: 
> > https://github.com/apache/mesos/blob/1.9.0/src/launcher/executor.cpp#L1060:L1062
> >  , do we want do the similar in Docker executor to ensure it can still self 
> > terminate in case the agent doesn't send an ACK for the terminal update for 
> > some reason?
> 
> Vinod Kone wrote:
> let's add the fail safe please.

I added a `delay` for 60 seconds before calling `driver->stop` as the fail-safe.


> On Янв. 29, 2020, 10:28 д.п., Qian Zhang wrote:
> > src/exec/exec.cpp
> > Lines 435 (patched)
> > 
> >
> > Do we want a `return;` after this code?

We don't need to. I want to make sure that corresponding objects are removed 
from `updates` and `tasks` in any case.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219410
---


On Янв. 29, 2020, 4:23 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Янв. 29, 2020, 4:23 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-29 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/
---

(Updated Янв. 29, 2020, 4:23 п.п.)


Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
Kone.


Changes
---

addressed Qian's comments


Summary (updated)
-

Changed termination logic of the Docker executor.


Bugs: MESOS-9847
https://issues.apache.org/jira/browse/MESOS-9847


Repository: mesos


Description (updated)
---

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.


Diffs (updated)
-

  src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
  src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 


Diff: https://reviews.apache.org/r/72055/diff/2/

Changes: https://reviews.apache.org/r/72055/diff/1-2/


Testing
---

internal CI


Thanks,

Andrei Budnik