[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370135#comment-16370135 ] Andrei Budnik edited comment on MESOS-8574 at 2/22/18 8:32 PM: --- https://reviews.apache.org/r/65713/ https://reviews.apache.org/r/65759/ was (Author: abudnik): [https://reviews.apache.org/r/65713/ https://reviews.apache.org/r/65759/ |https://reviews.apache.org/r/65713/] > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370135#comment-16370135 ] Andrei Budnik edited comment on MESOS-8574 at 2/22/18 8:32 PM: --- [https://reviews.apache.org/r/65713/ https://reviews.apache.org/r/65759/ |https://reviews.apache.org/r/65713/] was (Author: abudnik): https://reviews.apache.org/r/65713/ > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367927#comment-16367927 ] Greg Mann edited comment on MESOS-8574 at 2/16/18 10:19 PM: Had a few discussions offline with people today about this issue, and I am now thinking that we do not need to add a timeout for the Docker executor's initial {{docker inspect}} call. Rather, we can delegate task termination to the scheduler. If the scheduler does not receive any status updates for a task for a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} for this purpose: https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md I would propose the following behavior: 1) Docker executor runs its task via {{Docker::run()}}, and then makes its initial {{Docker::inspect()}} call. 2) If the Docker executor receives a {{KillTaskMessage}} later on, it will discard the {{Future}} from the aforementioned {{inspect()}} call, send a {{TASK_KILLING}} status update, and then call {{Docker::stop()}}. 3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, there are two possibilities: --a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which case we do not have a PID for the container. Thus, all we can do is retry {{Docker::stop()}} and continue attempting to kill. --b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case we have the container's PID, and we use {{os::killtree()}} to directly kill the container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update. We can also make steps #1 and #2 more robust in the face of transient Docker CLI issues by discarding/retrying the {{inspect()}} and {{stop()}} calls after some interval. I think that the steps outlined above constitute the highest-priority fixes which will provide the greatest improvement, and adding retries of inspect/stop will increase the number of scenarios from which we successfully recover, without requiring operator intervention. was (Author: greggomann): Had a few discussions offline with people today about this issue, and I am now thinking that we do not need to add a timeout for the Docker executor's initial {{docker inspect}} call. Rather, we can delegate task termination to the scheduler. If the scheduler does not receive any status updates for a task for a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} for this purpose: https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md I would propose the following behavior: 1) Docker executor runs its task via {{Docker::run()}}, and then makes its initial {{Docker::inspect()}} call. 2) If the Docker executor receives a {{KillTaskMessage}} later on, it will discard the {{Future}} from the aforementioned {{inspect()}} call, send a {{TASK_KILLING}} status update, and then call {{Docker::stop()}}. 3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, there are two possibilities: --a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which case we do not have a PID for the container. Thus, all we can do is retry {{Docker::stop()}} and continue attempting to kill. --b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case we have the container's PID, and we use {{os::killtree()}} to directly kill the container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update. We could also consider making steps #1 and #2 more robust in the face of transient Docker CLI issues by discarding/retrying the {{inspect()}} and {{stop()}} calls after some interval. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if
[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367927#comment-16367927 ] Greg Mann edited comment on MESOS-8574 at 2/16/18 9:54 PM: --- Had a few discussions offline with people today about this issue, and I am now thinking that we do not need to add a timeout for the Docker executor's initial {{docker inspect}} call. Rather, we can delegate task termination to the scheduler. If the scheduler does not receive any status updates for a task for a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} for this purpose: https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md I would propose the following behavior: 1) Docker executor runs its task via {{Docker::run()}}, and then makes its initial {{Docker::inspect()}} call. 2) If the Docker executor receives a {{KillTaskMessage}} later on, it will discard the {{Future}} from the aforementioned {{inspect()}} call, send a {{TASK_KILLING}} status update, and then call {{Docker::stop()}}. 3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, there are two possibilities: --a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which case we do not have a PID for the container. Thus, all we can do is retry {{Docker::stop()}} and continue attempting to kill. --b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case we have the container's PID, and we use {{os::killtree()}} to directly kill the container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update. We could also consider making steps #1 and #2 more robust in the face of transient Docker CLI issues by discarding/retrying the {{inspect()}} and {{stop()}} calls after some interval. was (Author: greggomann): Had a few discussions offline with people today about this issue, and I am now thinking that we do not need to add a timeout for the Docker executor's initial {{docker inspect}} call. Rather, we can delegate task termination to the scheduler. If the scheduler does not receive any status updates for a task for a while, it can kill it. Marathon, for example, has the {{task_launch_timeout}} for this purpose: https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md I would propose the following behavior: 1) Docker executor runs its task via {{Docker::run()}}, and then makes its initial {{Docker::inspect()}} call. 2) If the Docker executor receives a {{KillTaskMessage}} later on, it will discard the {{Future}} from the aforementioned {{inspect()}} call, send a {{TASK_KILLING}} status update, and then call {{Docker::stop()}}. 3) If the {{Docker::stop()}} call has not succeeded after the {{gracePeriod}}, there are two possibilities: a) The initial {{Docker::inspect()}} call from #1 never succeeded, in which case we do not have a PID for the container. Thus, all we can do is retry {{Docker::stop()}} and continue attempting to kill. b) The initial {{Docker::inspect()}} call from #1 did succeed, in which case we have the container's PID, and we use {{os::killtree()}} to directly kill the container with a {{SIGKILL}}, and then send a {{TASK_KILLED}} status update. We could also consider making steps #1 and #2 more robust in the face of transient Docker CLI issues by discarding/retrying the {{inspect()}} and {{stop()}} calls after some interval. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361964#comment-16361964 ] Greg Mann edited comment on MESOS-8574 at 2/16/18 9:03 PM: --- Based on discussions offline today, we started to converge on the following approach: * Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to enforce a timeout, after which the Docker executor calls {{docker->stop}} in an attempt to kill the container, if it's running. * The {{docker->stop}} call should be performed via {{killTask()}} so that a TASK_KILLING update is sent to the scheduler. This will allow the scheduler to retry the {{docker stop}} call by sending KILL calls. In such a case, we should discard any previously-made call to {{docker->stop}} (see MESOS-8575 for work needed to make this discard effective). was (Author: greggomann): Based on discussions offline today, we started to converge on the following approach: * Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to enforce a timeout, after which the Docker executor calls {{docker->stop}} in an attempt to kill the container, if it's running. * The executor registration timeout may be a logical choice for the duration of this timeout. * The {{docker->stop}} call should be performed via {{killTask()}} so that a TASK_KILLING update is sent to the scheduler. This will allow the scheduler to retry the {{docker stop}} call by sending KILL calls. In such a case, we should discard any previously-made call to {{docker->stop}} (see MESOS-8575 for work needed to make this discard effective). > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs
[ https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361964#comment-16361964 ] Greg Mann edited comment on MESOS-8574 at 2/13/18 8:11 AM: --- Based on discussions offline today, we started to converge on the following approach: * Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to enforce a timeout, after which the Docker executor calls {{docker->stop}} in an attempt to kill the container, if it's running. * The executor registration timeout may be a logical choice for the duration of this timeout. * The {{docker->stop}} call should be performed via {{killTask()}} so that a TASK_KILLING update is sent to the scheduler. This will allow the scheduler to retry the {{docker stop}} call by sending KILL calls. In such a case, we should discard any previously-made call to {{docker->stop}} (see MESOS-8575 for work needed to make this discard effective). was (Author: greggomann): Based on discussions offline today, we started to converge on the following approach: * Use {{.after()}} on the {{Future}} returned by {{docker->inspect()}} to enforce a timeout, after which the Docker executor calls {{docker->stop}} in an attempt to kill the container, if it's running. * The executor registration timeout may be a logical choice for the duration of this timeout. * The {{docker->stop}} call should be performed via {{killTask()}} so that a TASK_KILLING update is sent to the scheduler. This will allow the scheduler to retry the {{docker stop}} call by sending KILL calls. In such a case, we should discard any previously-made call to {{docker->stop}}. > Docker executor makes no progress when 'docker inspect' hangs > - > > Key: MESOS-8574 > URL: https://issues.apache.org/jira/browse/MESOS-8574 > Project: Mesos > Issue Type: Improvement > Components: docker, executor >Affects Versions: 1.5.0 >Reporter: Greg Mann >Priority: Major > Labels: mesosphere > > In the Docker executor, many calls later in the executor's lifecycle are > gated on an initial {{docker inspect}} call returning: > https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223 > If that first call to {{docker inspect}} never returns, the executor becomes > stuck in a state where it makes no progress and cannot be killed. > It's tempting for the executor to simply commit suicide after a timeout, but > we must be careful of the case in which the executor's Docker container is > actually running successfully, but the Docker daemon is unresponsive. In such > a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's > container is running successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)