[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-04-27 Thread Vishant Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456573#comment-16456573
 ] 

Vishant Singh commented on MESOS-8574:
--

[~abudnik]

not completely sure the reason for docker hang.

But it seems like the docker has stale information about running containers.

The container gets killed as part of a task kill request from marathon.As the 
docker task-kill involves SIGTREM (first) and then SIGIKILL (after timeout), 
the SIGKILL terminates the task but dockerd does not get updated of this state. 
Might because the SIGKILL does not have signal handlers which can eventually 
update the state information in docker.

After this, when a new task is launched on this host the docker inspect  or 
docker ps would be unresponsive.

At this point I have an monitoring on docker hang and idea is to restart the 
docker if its in hung state.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-04-12 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435779#comment-16435779
 ] 

Andrei Budnik commented on MESOS-8574:
--

[~vishant.si...@gmail.com] The root cause for a hanging Docker daemon or Docker 
CLI is unknown. We would appreciate any thoughts on what might be the reason 
for the issue in your case.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-04-12 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435767#comment-16435767
 ] 

Andrei Budnik commented on MESOS-8574:
--

[~vishant.si...@gmail.com] The issue was with a hanging Docker CLI that is used 
in built-in Docker executor.
 We've solved it by:
 1) automatically retrying `docker inspect` every `DOCKER_INSPECT_TIMEOUT` 
seconds in [the Docker 
executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L220-L251]
 and in the [Docker 
containerizer|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/slave/containerizer/docker.cpp#L1717-L1753];
 2) making `killTask` command retry-able for [the Docker 
executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L543-L565].
 If a scheduler (e.g. Marathon) retries `killTask`, then we abort previous 
(possibly hanging) `docker stop` command and start a new `docker stop`.

> do we need docker inspect just for the pid of the newly launched container?
 > wondering if there is an alternative to 'docker inspect' if its just pid.
 We are calling `docker inspect` not only to get the pid of a container.
 We have to wait until docker daemon marks container as `RUNNING` before 
sending `TASK_RUNNING` status update to a scheduler,
 otherwise we [might detect a container 
termination|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L253-L257]
 and send a terminal status update (e.g. `TASK_FINISHED`) before sending 
`TASK_RUNNING`. So we wait for `docker inspect` to guarantee a correct order of 
task status updates.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-04-12 Thread Vishant Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435702#comment-16435702
 ] 

Vishant Singh commented on MESOS-8574:
--

[~abudnik]

[~greggomann]

After going through all the comments, am bit confused.

are we adding timeout for docker inspect/stop?

Or

we depending on task termination from the scheduler after "task_launch_timeout"?

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-04-12 Thread Vishant Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435648#comment-16435648
 ] 

Vishant Singh commented on MESOS-8574:
--

[~greggomann]

trying to understand this issue as we hit this issue very often.

do we need docker inspect just for the pid of the newly launched container?

wondering if there is an alternative to 'docker inspect' if its just pid.

 

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0
>
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-16 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367927#comment-16367927
 ] 

Greg Mann commented on MESOS-8574:
--

Had a few discussions offline with people today about this issue, and I am now 
thinking that we do not need to add a timeout for the Docker executor's initial 
{{docker inspect}} call. Rather, we can delegate task termination to the 
scheduler. If the scheduler does not receive any status updates for a task for 
a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} 
for this purpose: 
https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md

I would propose the following behavior:
1) Docker executor runs its task via {{Docker::run()}}, and then makes its 
initial {{Docker::inspect()}} call.
2) If the Docker executor receives a {{KillTaskMessage}} later on, it will 
discard the {{Future}} from the aforementioned {{inspect()}} call, send a 
{{TASK_KILLING}} status update, and then call {{Docker::stop()}}.
3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, 
there are two possibilities:
  a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which 
case we do not have a PID for the container. Thus, all we can do is retry 
{{Docker::stop()}} and continue attempting to kill.
  b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case 
we have the container's PID, and we use {{os::killtree()}} to directly kill the 
container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update.

We could also consider making steps #1 and #2 more robust in the face of 
transient Docker CLI issues by discarding/retrying the {{inspect()}} and 
{{stop()}} calls after some interval.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-13 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361964#comment-16361964
 ] 

Greg Mann commented on MESOS-8574:
--

Based on discussions offline today, we started to converge on the following 
approach:
* Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to 
enforce a timeout, after which the Docker executor calls {{docker->stop}} in an 
attempt to kill the container, if it's running.
* The executor registration timeout may be a logical choice for the duration of 
this timeout.
* The {{docker->stop}} call should be performed via {{killTask()}} so that a 
TASK_KILLING update is sent to the scheduler. This will allow the scheduler to 
retry the {{docker stop}} call by sending KILL calls. In such a case, we should 
discard any previously-made call to {{docker->stop}}.

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)