Travis Hegner created MESOS-4581:
------------------------------------

             Summary: mesos-docker-executor has a race condition causing docker 
tasks to be stuck in staging when trying to launch
                 Key: MESOS-4581
                 URL: https://issues.apache.org/jira/browse/MESOS-4581
             Project: Mesos
          Issue Type: Bug
          Components: containerization, docker
    Affects Versions: 0.26.0, 0.27.0
         Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
            Reporter: Travis Hegner


We are still working to understand the root cause of this issue, but here is 
what we know so far:

Symptoms:

Launching docker containers from marathon in mesos results in the marathon app 
being stuck in a "staged" status, and the mesos task being stuck in a "staging" 
status until a timeout, at which point it will launch on another host, with 
approximately a 50/50 chance of working or being stuck staging again.

We have a lot of custom containers, custom networking configs, and custom 
docker run parameters, but we can't seem to narrow this down to any one 
particular aspect of our environment. This happens randomly per marathon app 
while it's attempting to start or restart an instance, whether the apps config 
has changed or not. I can't seem to find anyone else having a similar issue, 
which leads me to believe that it is a culmination of aspects within our 
environment to trigger this race condition.

Deeper analysis:
The mesos-docker-executor fires the "docker run ..." command in a future. It 
simultaneously (for all intents and purposes) fires a "docker inspect" against 
the container which it is trying to start at that moment. When we see this bug, 
the container starts normally, but the docker inspect command hangs forever. It 
never re-tries, and never times out.

When the task launches successfully, the docker inspect command fails once with 
an exit code, and retries 500ms later, working successfully and flagging the 
task as "RUNNING" in both mesos and marathon simultaneously.

If you watch the docker log, you'll notice that a "docker run" via the command 
line actually triggers 3 docker API calls in succession. "create", "attach", 
and "start", in that order. It's been fairly consistent that when we see this 
bug triggered, the docker log has the "create" from the run command, then a GET 
for the inspect command, then an "attach", and "start" later. When we see this 
work successfully, we see the GET first (failing, of course because the 
container doesn't exist yet), and then the "create", "attach", and "start".

Rudimentary Solution:
We have written a very basic patch which delays that initial inspect call on 
the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of the 
docker run command. This has eliminated the bug as far as we can tell.

I am not sure if this one time initial delay is the most appropriate fix, or if 
it would be better to add a timeout to the inspect call in the 
mesos-docker-executor, which destroys the current inspect thread and starts a 
new one. The timeout/retry may be appropriate whether the initial delay exists 
or not.

In Summary:
It appears that mesos-docker-executor does not have a race condition itself, 
but it seems to be triggering one in docker. Since we haven't found this issue 
anywhere else with any substance, we understand that it is likely related to 
our environment. Our custom network driver for docker does some cluster-wide 
coordination, and may introduce just enough delay between the "create" and 
"attach" calls that are causing us to witness this bug at about a 50-60% rate 
of attempted container start.

The inspectDelay patch that I've written for this issue is located in my 
inspectDelay branch at:
https://github.com/travishegner/mesos/tree/inspectDelay

I am happy to supply this patch as a pull request, or put it through the review 
board if the maintainers feel this is an appropriate fix, or at least as a 
stop-gap measure until a better fix can be written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to