[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456573#comment-16456573 ] Vishant Singh commented on MESOS-8574: -- [~abudnik] not completely sure the reason for docker hang. But it seems like the docker has stale information about running containers. The container gets killed as part of a task kill request from marathon.As the docker task-kill involves SIGTREM (first) and then SIGIKILL (after timeout), the SIGKILL terminates the task but dockerd does not get updated of this state. Might because the SIGKILL does not have signal handlers which can eventually update the state information in docker. After this, when a new task is launched on this host the docker inspect or docker ps would be unresponsive. At this point I have an monitoring on docker hang and idea is to restart the docker if its in hung state. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0 > > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435779#comment-16435779 ] Andrei Budnik commented on MESOS-8574: -- [~vishant.si...@gmail.com] The root cause for a hanging Docker daemon or Docker CLI is unknown. We would appreciate any thoughts on what might be the reason for the issue in your case. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0 > > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435767#comment-16435767 ] Andrei Budnik commented on MESOS-8574: -- [~vishant.si...@gmail.com] The issue was with a hanging Docker CLI that is used in built-in Docker executor. We've solved it by: 1) automatically retrying `docker inspect` every `DOCKER_INSPECT_TIMEOUT` seconds in [the Docker executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L220-L251] and in the [Docker containerizer|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/slave/containerizer/docker.cpp#L1717-L1753]; 2) making `killTask` command retry-able for [the Docker executor|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L543-L565]. If a scheduler (e.g. Marathon) retries `killTask`, then we abort previous (possibly hanging) `docker stop` command and start a new `docker stop`. > do we need docker inspect just for the pid of the newly launched container? > wondering if there is an alternative to 'docker inspect' if its just pid. We are calling `docker inspect` not only to get the pid of a container. We have to wait until docker daemon marks container as `RUNNING` before sending `TASK_RUNNING` status update to a scheduler, otherwise we [might detect a container termination|https://github.com/apache/mesos/blob/99c73e0c4b0bee67790d98650e064843fef3933c/src/docker/executor.cpp#L253-L257] and send a terminal status update (e.g. `TASK_FINISHED`) before sending `TASK_RUNNING`. So we wait for `docker inspect` to guarantee a correct order of task status updates. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0 > > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435702#comment-16435702 ] Vishant Singh commented on MESOS-8574: -- [~abudnik] [~greggomann] After going through all the comments, am bit confused. are we adding timeout for docker inspect/stop? Or we depending on task termination from the scheduler after "task_launch_timeout"? > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0 > > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435648#comment-16435648 ] Vishant Singh commented on MESOS-8574: -- [~greggomann] trying to understand this issue as we hit this issue very often. do we need docker inspect just for the pid of the newly launched container? wondering if there is an alternative to 'docker inspect' if its just pid. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.3.3, 1.4.2, 1.5.1, 1.6.0 > > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367927#comment-16367927 ] Greg Mann commented on MESOS-8574: -- Had a few discussions offline with people today about this issue, and I am now thinking that we do not need to add a timeout for the Docker executor's initial {{docker inspect}} call. Rather, we can delegate task termination to the scheduler. If the scheduler does not receive any status updates for a task for a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} for this purpose: https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md I would propose the following behavior: 1) Docker executor runs its task via {{Docker::run()}}, and then makes its initial {{Docker::inspect()}} call. 2) If the Docker executor receives a {{KillTaskMessage}} later on, it will discard the {{Future}} from the aforementioned {{inspect()}} call, send a {{TASK_KILLING}} status update, and then call {{Docker::stop()}}. 3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, there are two possibilities: a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which case we do not have a PID for the container. Thus, all we can do is retry {{Docker::stop()}} and continue attempting to kill. b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case we have the container's PID, and we use {{os::killtree()}} to directly kill the container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update. We could also consider making steps #1 and #2 more robust in the face of transient Docker CLI issues by discarding/retrying the {{inspect()}} and {{stop()}} calls after some interval. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361964#comment-16361964 ] Greg Mann commented on MESOS-8574: -- Based on discussions offline today, we started to converge on the following approach: * Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to enforce a timeout, after which the Docker executor calls {{docker->stop}} in an attempt to kill the container, if it's running. * The executor registration timeout may be a logical choice for the duration of this timeout. * The {{docker->stop}} call should be performed via {{killTask()}} so that a TASK_KILLING update is sent to the scheduler. This will allow the scheduler to retry the {{docker stop}} call by sending KILL calls. In such a case, we should discard any previously-made call to {{docker->stop}}. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)