[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-22 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370135#comment-16370135
 ] 

Andrei Budnik edited comment on MESOS-8574 at 2/22/18 8:32 PM:
---

https://reviews.apache.org/r/65713/
https://reviews.apache.org/r/65759/


was (Author: abudnik):
[https://reviews.apache.org/r/65713/
https://reviews.apache.org/r/65759/
|https://reviews.apache.org/r/65713/]

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-16 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367927#comment-16367927
 ] 

Greg Mann edited comment on MESOS-8574 at 2/16/18 10:19 PM:


Had a few discussions offline with people today about this issue, and I am now 
thinking that we do not need to add a timeout for the Docker executor's initial 
{{docker inspect}} call. Rather, we can delegate task termination to the 
scheduler. If the scheduler does not receive any status updates for a task for 
a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} 
for this purpose: 
https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md

I would propose the following behavior:
1) Docker executor runs its task via {{Docker::run()}}, and then makes its 
initial {{Docker::inspect()}} call.
2) If the Docker executor receives a {{KillTaskMessage}} later on, it will 
discard the {{Future}} from the aforementioned {{inspect()}} call, send a 
{{TASK_KILLING}} status update, and then call {{Docker::stop()}}.
3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, 
there are two possibilities:
--a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which 
case we do not have a PID for the container. Thus, all we can do is retry 
{{Docker::stop()}} and continue attempting to kill.
--b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case 
we have the container's PID, and we use {{os::killtree()}} to directly kill the 
container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update.

We can also make steps #1 and #2 more robust in the face of transient Docker 
CLI issues by discarding/retrying the {{inspect()}} and {{stop()}} calls after 
some interval. I think that the steps outlined above constitute the 
highest-priority fixes which will provide the greatest improvement, and adding 
retries of inspect/stop will increase the number of scenarios from which we 
successfully recover, without requiring operator intervention.


was (Author: greggomann):
Had a few discussions offline with people today about this issue, and I am now 
thinking that we do not need to add a timeout for the Docker executor's initial 
{{docker inspect}} call. Rather, we can delegate task termination to the 
scheduler. If the scheduler does not receive any status updates for a task for 
a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} 
for this purpose: 
https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md

I would propose the following behavior:
1) Docker executor runs its task via {{Docker::run()}}, and then makes its 
initial {{Docker::inspect()}} call.
2) If the Docker executor receives a {{KillTaskMessage}} later on, it will 
discard the {{Future}} from the aforementioned {{inspect()}} call, send a 
{{TASK_KILLING}} status update, and then call {{Docker::stop()}}.
3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, 
there are two possibilities:
--a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which 
case we do not have a PID for the container. Thus, all we can do is retry 
{{Docker::stop()}} and continue attempting to kill.
--b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case 
we have the container's PID, and we use {{os::killtree()}} to directly kill the 
container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update.

We could also consider making steps #1 and #2 more robust in the face of 
transient Docker CLI issues by discarding/retrying the {{inspect()}} and 
{{stop()}} calls after some interval.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> 

[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-16 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367927#comment-16367927
 ] 

Greg Mann edited comment on MESOS-8574 at 2/16/18 9:54 PM:
---

Had a few discussions offline with people today about this issue, and I am now 
thinking that we do not need to add a timeout for the Docker executor's initial 
{{docker inspect}} call. Rather, we can delegate task termination to the 
scheduler. If the scheduler does not receive any status updates for a task for 
a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} 
for this purpose: 
https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md

I would propose the following behavior:
1) Docker executor runs its task via {{Docker::run()}}, and then makes its 
initial {{Docker::inspect()}} call.
2) If the Docker executor receives a {{KillTaskMessage}} later on, it will 
discard the {{Future}} from the aforementioned {{inspect()}} call, send a 
{{TASK_KILLING}} status update, and then call {{Docker::stop()}}.
3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, 
there are two possibilities:
--a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which 
case we do not have a PID for the container. Thus, all we can do is retry 
{{Docker::stop()}} and continue attempting to kill.
--b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case 
we have the container's PID, and we use {{os::killtree()}} to directly kill the 
container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update.

We could also consider making steps #1 and #2 more robust in the face of 
transient Docker CLI issues by discarding/retrying the {{inspect()}} and 
{{stop()}} calls after some interval.


was (Author: greggomann):
Had a few discussions offline with people today about this issue, and I am now 
thinking that we do not need to add a timeout for the Docker executor's initial 
{{docker inspect}} call. Rather, we can delegate task termination to the 
scheduler. If the scheduler does not receive any status updates for a task for 
a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} 
for this purpose: 
https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md

I would propose the following behavior:
1) Docker executor runs its task via {{Docker::run()}}, and then makes its 
initial {{Docker::inspect()}} call.
2) If the Docker executor receives a {{KillTaskMessage}} later on, it will 
discard the {{Future}} from the aforementioned {{inspect()}} call, send a 
{{TASK_KILLING}} status update, and then call {{Docker::stop()}}.
3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, 
there are two possibilities:
  a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which 
case we do not have a PID for the container. Thus, all we can do is retry 
{{Docker::stop()}} and continue attempting to kill.
  b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case 
we have the container's PID, and we use {{os::killtree()}} to directly kill the 
container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update.

We could also consider making steps #1 and #2 more robust in the face of 
transient Docker CLI issues by discarding/retrying the {{inspect()}} and 
{{stop()}} calls after some interval.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-16 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361964#comment-16361964
 ] 

Greg Mann edited comment on MESOS-8574 at 2/16/18 9:03 PM:
---

Based on discussions offline today, we started to converge on the following 
approach:
* Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to 
enforce a timeout, after which the Docker executor calls {{docker->stop}} in an 
attempt to kill the container, if it's running.
* The {{docker->stop}} call should be performed via {{killTask()}} so that a 
TASK_KILLING update is sent to the scheduler. This will allow the scheduler to 
retry the {{docker stop}} call by sending KILL calls. In such a case, we should 
discard any previously-made call to {{docker->stop}} (see MESOS-8575 for work 
needed to make this discard effective).


was (Author: greggomann):
Based on discussions offline today, we started to converge on the following 
approach:
* Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to 
enforce a timeout, after which the Docker executor calls {{docker->stop}} in an 
attempt to kill the container, if it's running.
* The executor registration timeout may be a logical choice for the duration of 
this timeout.
* The {{docker->stop}} call should be performed via {{killTask()}} so that a 
TASK_KILLING update is sent to the scheduler. This will allow the scheduler to 
retry the {{docker stop}} call by sending KILL calls. In such a case, we should 
discard any previously-made call to {{docker->stop}} (see MESOS-8575 for work 
needed to make this discard effective).

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-13 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361964#comment-16361964
 ] 

Greg Mann edited comment on MESOS-8574 at 2/13/18 8:11 AM:
---

Based on discussions offline today, we started to converge on the following 
approach:
* Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to 
enforce a timeout, after which the Docker executor calls {{docker->stop}} in an 
attempt to kill the container, if it's running.
* The executor registration timeout may be a logical choice for the duration of 
this timeout.
* The {{docker->stop}} call should be performed via {{killTask()}} so that a 
TASK_KILLING update is sent to the scheduler. This will allow the scheduler to 
retry the {{docker stop}} call by sending KILL calls. In such a case, we should 
discard any previously-made call to {{docker->stop}} (see MESOS-8575 for work 
needed to make this discard effective).


was (Author: greggomann):
Based on discussions offline today, we started to converge on the following 
approach:
* Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to 
enforce a timeout, after which the Docker executor calls {{docker->stop}} in an 
attempt to kill the container, if it's running.
* The executor registration timeout may be a logical choice for the duration of 
this timeout.
* The {{docker->stop}} call should be performed via {{killTask()}} so that a 
TASK_KILLING update is sent to the scheduler. This will allow the scheduler to 
retry the {{docker stop}} call by sending KILL calls. In such a case, we should 
discard any previously-made call to {{docker->stop}}.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)