Steven Schlansker created MESOS-5195:
----------------------------------------

             Summary: Docker executor: task logs lost on shutdown
                 Key: MESOS-5195
                 URL: https://issues.apache.org/jira/browse/MESOS-5195
             Project: Mesos
          Issue Type: Bug
          Components: containerization, docker
    Affects Versions: 0.27.2
         Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS"
            Reporter: Steven Schlansker


When you try to kill a task running in the Docker executor (in our case via 
Singularity), the task shuts down cleanly but the last logs to standard out / 
standard error are lost in teardown.

For example, we run dumb-init.  With debugging on, you can see it should write:
{noformat}
DEBUG("Forwarded signal %d to children.\n", signum);
{noformat}

If you attach strace to the process, you can see it clearly writes the text to 
stderr.  But that message is lost and never is written to the sandbox 'stderr' 
file.

We believe the issue starts here, in Docker executor.cpp:

{code}
  void shutdown(ExecutorDriver* driver)
  {
    cout << "Shutting down" << endl;

    if (run.isSome() && !killed) {
      // The docker daemon might still be in progress starting the
      // container, therefore we kill both the docker run process
      // and also ask the daemon to stop the container.

      // Making a mutable copy of the future so we can call discard.
      Future<Nothing>(run.get()).discard();
      stop = docker->stop(containerName, stopTimeout);
      killed = true;
    }
  }
{code}

Notice how the "run" future is discarded *before* the Docker daemon is told to 
stop -- now what will discarding it do?

{code}
void commandDiscarded(const Subprocess& s, const string& cmd)
{
  VLOG(1) << "'" << cmd << "' is being discarded";
  os::killtree(s.pid(), SIGKILL);
}
{code}

Oops, just sent SIGKILL to the entire process tree...

You can see another (harmless?) side effect in the Docker daemon logs, it never 
gets a chance to kill the task:

{noformat}
ERROR Handler for DELETE 
/v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
 returned error: No such container: 
mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
{noformat}

I suspect that the fix is wait for 'docker->stop()' to complete before 
discarding the 'run' future.

Happy to provide more information if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to