Repository: mesos Updated Branches: refs/heads/1.3.x 699f80dc3 -> 487c41fec
Reaped the container process directly in Docker executor. Due to a Docker issue (https://github.com/moby/moby/issues/33820), Docker daemon can fail to catch a container exit, i.e., the container process has already exited but the command `docker ps` shows the container still running, this will lead to the "docker run" command that we execute in Docker executor never returning, and it will also cause the `docker stop` command takes no effect, i.e., it will return without error but `docker ps` shows the container still running, so the task will stuck in `TASK_KILLING` state. To workaround this Docker issue, in this patch we made Docker executor reaps the container process directly so Docker executor will be notified once the container process exits. Review: https://reviews.apache.org/r/65518 Project: http://git-wip-us.apache.org/repos/asf/mesos/repo Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/8d1de046 Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/8d1de046 Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/8d1de046 Branch: refs/heads/1.3.x Commit: 8d1de046d99ee1e81a68b2e7cf3b7754563f060c Parents: 699f80d Author: Qian Zhang <zhq527...@gmail.com> Authored: Mon Feb 5 20:42:07 2018 +0800 Committer: Gilbert Song <songzihao1...@gmail.com> Committed: Fri Mar 2 16:57:30 2018 -0800 ---------------------------------------------------------------------- src/docker/executor.cpp | 60 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mesos/blob/8d1de046/src/docker/executor.cpp ---------------------------------------------------------------------- diff --git a/src/docker/executor.cpp b/src/docker/executor.cpp index 79cd4e5..917644e 100644 --- a/src/docker/executor.cpp +++ b/src/docker/executor.cpp @@ -242,6 +242,47 @@ public: driver->sendStatusUpdate(status); } + // This is a workaround for the Docker issue below: + // https://github.com/moby/moby/issues/33820 + // Due to this issue, Docker daemon can fail to catch a container exit, + // which will lead to the `docker run` command that we execute in this + // executor never returning although the container has already exited. + // To workaround this issue, here we reap the container process directly + // so we will be notified when the container exits. + if (container.pid.isSome()) { + process::reap(container.pid.get()) + .then(defer(self(), [=](const Option<int>& status) { + // We cannot get the actual exit status of the container + // process since it is not a child process of this executor, + // so here `status` must be `None`. + CHECK_NONE(status); + + // There will be a race between the method `reaped` and this + // lambda; ideally when the Docker issue mentioned above does + // not occur, `reaped` will be invoked (i.e., the `docker run` + // command returns) to get the actual exit status of the + // container, so here we wait a few seconds for `reaped` to be + // invoked. If `reaped` is not invoked within the timeout, that + // means we hit that Docker issue. + delay( + Seconds(3), + self(), + &Self::reapedContainer, + container.pid.get()); + + return Nothing(); + })); + } else { + // This is the case that the container process has already exited, + // Similar to the above case, let's wait a few seconds for `reaped` + // to be invoked. + delay( + Seconds(3), + self(), + &Self::reapedContainer, + None()); + } + return Nothing(); })); @@ -450,8 +491,27 @@ private: } } + void reapedContainer(Option<pid_t> pid) + { + // Do nothing if the method `reaped` has already been invoked. + if (terminated) { + return; + } + + // This means the Docker issue mentioned in `launchTask` occurs. + LOG(WARNING) << "The container process" + << (pid.isSome() ? " (pid: " + stringify(pid.get()) + ")" : "") + << " has exited, but Docker daemon failed to catch it."; + + reaped(None()); + } + void reaped(const Future<Option<int>>& run) { + if (terminated) { + return; + } + terminated = true; // Stop health checking the task.