Re: docker based executor

Tim Chen Sat, 18 Apr 2015 22:44:32 -0700

Hi Tyson,

Glad you figured it out, sorry didn't realize you were running mesos slave
in a docker (which surely complicates things).


 I have a series of patches that is pending to be merged that will also
make recovering tasks when relaunching mesos-slave in a docker works.
Currently even with --pid=host when your slave dies your tasks are not able
to recover when it restarts.

Tim

On Sat, Apr 18, 2015 at 10:32 PM, Tyson Norris <[email protected]> wrote:

>  Yes, this was the problem - sorry for the noise.
>
>  For the record, running mesos-slave in a container requires "--pid=host”
> option as mentioned in MESOS-2183
>
>  Now if docker-compose would just get released with the support for
> setting pid flag, life would be easy...
>
>  Thanks
> Tyson
>
>  On Apr 18, 2015, at 9:48 PM, Tyson Norris <[email protected]> wrote:
>
>  I think I may be running into this:
> https://issues.apache.org/jira/browse/MESOS-2183
>
>  I’m trying to get docker-compose to launch slave with --pid=host, but
> having a few separate problems with that.
>
>  I will update this thread when I’m able to test that.
>
>  Thanks
> Tyson
>
>  On Apr 18, 2015, at 1:14 PM, Tyson Norris <[email protected]> wrote:
>
>  Hi Tim - Actually, rereading your email: "For a test image like this you
> want to set the CommandInfo with a ContainerInfo holding the docker image
> instead.” it sounds like you are suggesting running the container as a task
> command? But part of what I’m doing is trying to provide a custom executor,
> so I think what I had before is appropriate - eventually I want to make the
> tasks launch (same e.g. similar to existing mesos-storm framework), but I
> am trying to launch the executor as a container instead of a script
> command, which I think should be possible.
>
>  So maybe you can comment on using a container within an ExecutorInfo as
> below?
> Docs here:
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L267
> suggest that ContainerInfo and CommandInfo should be provided - I am using
> setShell(false) to avoid changing the entry point, which already uses the
> default "/bin/sh -c”.
>
>
>  Thanks
> Tyson
>
>
>  On Apr 18, 2015, at 1:03 PM, Tyson Norris <[email protected]> wrote:
>
>  Hi Tim -
> I am using my own framework - a modified version of mesos-storm,
> attempting to use docker containers instead of
>
>  TaskInfo is like:
>               TaskInfo task = TaskInfo.newBuilder()
>                   .setName("worker " + slot.getNodeId() + ":" +
> slot.getPort())
>                   .setTaskId(taskId)
>                   .setSlaveId(offer.getSlaveId())
>                   .setExecutor(ExecutorInfo.newBuilder()
>
> .setExecutorId(ExecutorID.newBuilder().setValue(details.getId()))
>
> .setData(ByteString.copyFromUtf8(executorDataStr))
>                                   .setCommand(CommandInfo.newBuilder()
>                                           .setShell(false)
>                                   )
>
>                                   .setContainer(ContainerInfo.newBuilder()
>
> .setType(ContainerInfo.Type.DOCKER)
>
> .setDocker(ContainerInfo.DockerInfo.newBuilder()
>                                                   .setImage("testexecutor")
>                                           )
>                                     )
>
>  I understand this test image will be expected to fail  - I expect it to
> fail by registration timeout, and not by simply dying though. I’m only
> using a test image, because I see the same behavior with my actual image
> that properly handles mesos - executor registration protocol.
>
>  I will try moving the Container inside the Command, and see if it
> survives longer.
>
>  I see now at
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L675
> it mentions "Either ExecutorInfo or CommandInfo should be set”
>
>  Thanks
> Tyson
>
>
>  On Apr 18, 2015, at 12:38 PM, Tim Chen <[email protected]> wrote:
>
>  That does seems odd, how did you run this via mesos? Are you using your
> own framework or through another framework like Marathon?
>
>  And what does the TaskInfo look like?
>
>  Also note that if you're just testing a container, you don't want to set
> the ExecutorInfo with a command as Executors in Mesos are expected to
> communicate back to Mesos slave and implement the protocol between mesos
> and executor. For a test image like this you want to set the CommandInfo
> with a ContainerInfo holding the docker image instead.
>
>  Tim
>
> On Sat, Apr 18, 2015 at 12:17 PM, Tyson Norris <[email protected]> wrote:
>
>> Hi Tim -
>> Yes, I mentioned below when using a script like:
>> --------------------------------------
>> #!/bin/bash
>> until false; do
>>   echo "waiting for something to do something"
>>   sleep 0.2
>> done
>> --------------------------------------
>>
>> In my sandbox stdout I get exactly 2 lines:
>> waiting for something to do something
>> waiting for something to do something
>>
>>  Running this container any other way, e.g. docker run --rm -it
>> testexecutor, the output is an endless stream of "waiting for something to
>> do something”.
>>
>>  So something is stopping the container, as opposed to the container
>> just exiting; at least that’s how it looks - I only get the container to
>> stop when it is launched as an executor.
>>
>>  Also, based on the docker logs, something is calling the
>> /container/id/stop endpoint, *before* the /container/id/logs endpoint - so
>> the stop is arriving before the logs are tailed, which also seems
>> incorrect, and suggests that there is some code explicating stopping the
>> container, instead of the container exiting itself.
>>
>>  Thanks
>>  Tyson
>>
>>
>>
>>  On Apr 18, 2015, at 3:33 AM, Tim Chen <[email protected]> wrote:
>>
>>  Hi Tyson,
>>
>>  The error message you saw in the logs about the executor exited
>> actually just means the executor process has exited.
>>
>>  Since you're launching a custom executor with MesosSupervisor, it seems
>> like MesosSupervisor simply exited without reporting any task status.
>>
>>  Can you look at what's the actual logs of the container? They can be
>> found in the sandbox stdout and stderr logs.
>>
>>  Tim
>>
>> On Fri, Apr 17, 2015 at 11:16 PM, Tyson Norris <[email protected]> wrote:
>>
>>>  The sequence I see in the docker.log when my executor is launched is
>>> something like:
>>> GET /containers/id/json
>>> POST /containers/id/wait
>>> POST /containers/id/stop
>>> GET /containers/id/logs
>>>
>>>  So I’m wondering if the slave is calling docker->stop out of order in
>>> slave/containerizer/docker.cpp
>>> I only see it being called in recover and destroy and I don’t see logs
>>> indicating either of those happening, but I may be missing something else
>>>
>>>  Tyson
>>>
>>>  On Apr 17, 2015, at 9:42 PM, Tyson Norris <[email protected]> wrote:
>>>
>>>  mesos master INFO log says:
>>> I0418 04:26:31.573763     6 master.cpp:3755] Sending 1 offers to
>>> framework 20150411-165219-771756460-5050-1-0000 (marathon) at scheduler-
>>> [email protected]:44364
>>> I0418 04:26:31.580003     9 master.cpp:2268] Processing ACCEPT call for
>>> offers: [ 20150418-041001-553718188-5050-1-O165 ] on
>>> slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051
>>> (mesos-slave1.service.consul) for framework
>>> 20150411-165219-771756460-5050-1-0000 (marathon) at
>>> [email protected]:44364
>>> I0418 04:26:31.580369     9 hierarchical.hpp:648] Recovered cpus(*):6;
>>> mem(*):3862; disk(*):13483; ports(*):[31001-32000] (total allocatable:
>>> cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000]) on slave
>>> 20150418-041001-553718188-5050-1-S0 from
>>> framework 20150411-165219-771756460-5050-1-0000
>>> I0418 04:26:32.480036    12 master.cpp:3388] Executor
>>> insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 on
>>> slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051
>>> (mesos-slave1.service.consul) terminated with signal Unknown signal 127
>>>
>>>  mesos slave  INFO log says:
>>> I0418 04:26:31.390650     8 slave.cpp:1231] Launching task
>>> mesos-slave1.service.consul-31000 for framework
>>> 20150418-041001-553718188-5050-1-0001
>>> I0418 04:26:31.392432     8 slave.cpp:4160] Launching executor
>>> insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001
>>> in work directory '/tmp/mesos/slaves/20150418-041001-553718188-5050-
>>>
>>> 1-S0/frameworks/20150418-041001-553718188-5050-1-0001/executors/insights-1-1429330829/runs/3cc411b0-c2e0-41ae-80c2-f0306371da5a'
>>> I0418 04:26:31.392587     8 slave.cpp:1378] Queuing task
>>> 'mesos-slave1.service.consul-31000' for executor insights-1-1429330829
>>> of framework '20150418-041001-553718188-5050-1-0001
>>> I0418 04:26:31.397415     7 docker.cpp:755] Starting container
>>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a' for executor
>>> 'insights-1-1429330829' and framework
>>> '20150418-041001-553718188-5050-1-0001'
>>> I0418 04:26:31.397835     7 fetcher.cpp:238] Fetching URIs using command
>>> '/usr/libexec/mesos/mesos-fetcher'
>>> I0418 04:26:32.177479    11 docker.cpp:1333] Executor for container
>>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a' has exited
>>> I0418 04:26:32.177817    11 docker.cpp:1159] Destroying container
>>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a'
>>> I0418 04:26:32.177999    11 docker.cpp:1248] Running docker stop on
>>> container '3cc411b0-c2e0-41ae-80c2-f0306371da5a'
>>> I0418 04:26:32.177620     6 slave.cpp:3135] Monitoring executor
>>> 'insights-1-1429330829' of framework
>>> '20150418-041001-553718188-5050-1-0001' in container
>>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a'
>>> I0418 04:26:32.477990    12 slave.cpp:3186] Executor
>>> 'insights-1-1429330829' of framework 20150418-041001-553718188-5050-1-0001
>>> has terminated with unknown status
>>> I0418 04:26:32.479394    12 slave.cpp:2508] Handling status update
>>> TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task
>>> mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188-
>>> 5050-1-0001 from @0.0.0.0:0
>>> W0418 04:26:32.479645    12 docker.cpp:841] Ignoring updating unknown
>>> container: 3cc411b0-c2e0-41ae-80c2-f0306371da5a
>>> I0418 04:26:32.480041    10 status_update_manager.cpp:317] Received
>>> status update TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for
>>> task mesos-slave1.service.consul-31000 of framework 20150418-04
>>> 1001-553718188-5050-1-0001
>>> I0418 04:26:32.481073    12 slave.cpp:2753] Forwarding the update
>>> TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task
>>> mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188-5
>>> 050-1-0001 to [email protected]:5050
>>>
>>>  docker.log says
>>> time="2015-04-18T04:26:31Z" level=debug msg="Calling POST
>>> /containers/create"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="POST
>>> /v1.18/containers/create?name=mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a"
>>>
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="+job
>>> create(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)"
>>>
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="+job log(create,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="-job log(create,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)
>>> = OK (0)"
>>> time="2015-04-18T04:26:31Z" level=info msg="-job
>>> create(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)"
>>>
>>> time="2015-04-18T04:26:31Z" level=debug msg="Calling POST
>>> /containers/{name:.*}/start"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="POST
>>> /v1.18/containers/4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4/start"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="+job
>>> start(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4)"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="+job log(start,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="-job log(start,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)
>>> = OK (0)"
>>> time="2015-04-18T04:26:31Z" level=debug msg="Calling GET
>>> /containers/{name:.*}/json"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="GET
>>> /containers/4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4/json"
>>>
>>> time="2015-04-18T04:26:31Z" level=info msg="+job
>>> container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> start(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) =
>>> OK (0)"
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4)
>>> = OK (0)"
>>> time="2015-04-18T04:26:32Z" level=debug msg="Calling GET
>>> /containers/{name:.*}/json"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="GET
>>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job
>>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)"
>>>
>>> time="2015-04-18T04:26:32Z" level=debug msg="Calling GET
>>> /containers/{name:.*}/json"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="GET
>>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job
>>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)"
>>>
>>> time="2015-04-18T04:26:32Z" level=debug msg="Calling POST
>>> /containers/{name:.*}/wait"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="POST
>>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/wait"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job
>>> wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)"
>>>
>>> time="2015-04-18T04:26:32Z" level=debug msg="Calling GET
>>> /containers/{name:.*}/logs"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="GET
>>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/logs?follow=1&stderr=1&stdout=1&tail=all"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job
>>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job
>>> logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)"
>>>
>>> time="2015-04-18T04:26:32Z" level=debug msg="Calling POST
>>> /containers/{name:.*}/stop"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="POST
>>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/stop?t=0"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job
>>> stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)"
>>>
>>> time="2015-04-18T04:26:32Z" level=debug msg="Sending 15 to
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="Container
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4 failed to
>>> exit within 0 seconds of SIGTERM - using the force"
>>> time="2015-04-18T04:26:32Z" level=debug msg="Sending 9 to
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job log(die,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="-job log(die,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)
>>> = OK (0)"
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="+job log(stop,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)"
>>>
>>> time="2015-04-18T04:26:32Z" level=info msg="-job log(stop,
>>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, 
>>> testexecutor:latest)
>>> = OK (0)"
>>> time="2015-04-18T04:26:32Z" level=info msg="-job
>>> stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)”
>>>
>>>
>>>  I don’t see a syslog for the master/slave containers
>>>
>>>  Thanks
>>> Tyson
>>>
>>>
>>>
>>>
>>>  On Apr 17, 2015, at 7:07 PM, Jason Giedymin <[email protected]>
>>> wrote:
>>>
>>>  What do any/all logs say? (syslog)
>>>
>>> -Jason
>>>
>>> On Apr 17, 2015, at 7:22 PM, Tyson Norris <[email protected]> wrote:
>>>
>>>  another interesting fact:
>>> I can restart the docker container of my executor, and it runs great.
>>>
>>>  In the test example below, notice the stdout appears to be growing as
>>> expected after restarting the container.
>>>
>>>  So something is killing my executor container (also indicated by the
>>> "Exited (137) About a minute ago”), but I’m still not sure what.
>>>
>>>  Thanks
>>> Tyson
>>>
>>>
>>>
>>>  tnorris-osx:insights tnorris$ docker ps -a | grep testexec
>>> 5291fe29c9c2        testexecutor:latest
>>>                       "/bin/sh -c executor   About a minute ago   Exited
>>> (137) About a minute ago
>>>
>>>   mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9
>>> tnorris-osx:insights tnorris$ docker start
>>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9
>>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9
>>> tnorris-osx:insights tnorris$ docker logs
>>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> waiting for something to do something
>>> tnorris-osx:insights tnorris$ docker stop
>>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9
>>>
>>>
>>>  On Apr 17, 2015, at 2:11 PM, Tyson Norris <[email protected]> wrote:
>>>
>>>  You can reproduce with most any dockerfile, I think - it seems like
>>> launching a customer executor that is a docker container has some problem.
>>>
>>>  I just made a simple test with docker file:
>>> --------------------------------------
>>> #this is oracle java8 atop phusion baseimage
>>> FROM opentable/baseimage-java8:latest
>>>
>>>
>>> #mesos lib (not used here, but will be in our “real” executor, e.g. to
>>> register the executor etc)
>>> RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv
>>> E56151BF
>>> RUN echo "deb http://repos.mesosphere.io/$(lsb_release -is | tr
>>> '[:upper:]' '[:lower:]') $(lsb_release -cs) main" | tee
>>> /etc/apt/sources.list.d/mesosphere.list
>>> RUN cat /etc/apt/sources.list.d/mesosphere.list
>>> RUN apt-get update && apt-get install -y \
>>>     mesos
>>>
>>> ADD script.sh /usr/bin/executor-script.sh
>>>
>>> CMD executor-script.sh
>>> --------------------------------------
>>>
>>>  and script.sh:
>>>  --------------------------------------
>>>  #!/bin/bash
>>> until false; do
>>>   echo "waiting for something to do something"
>>>   sleep 0.2
>>> done
>>>  --------------------------------------
>>>
>>>  And in my stdout I get exactly 2 lines:
>>>  waiting for something to do something
>>> waiting for something to do something
>>>
>>>  Which is how many lines can be output in within 0.5 seconds…something
>>> is fishy about the 0.5 seconds, but I’m not sure where.
>>>
>>>  I’m not sure exactly the difference, but launching a docker container
>>> as a task WITHOUT a custom executor works fine, and I’m not sure about
>>> launching a docker container as a task that is using a non-docker custom
>>> executor. The case I’m trying for is using a docker customer executor, and
>>> launching non-docker tasks. (in case that helps clarify the situation).
>>>
>>>  Thanks
>>> Tyson
>>>
>>>
>>>
>>>
>>>
>>>  On Apr 17, 2015, at 1:47 PM, Jason Giedymin <[email protected]>
>>> wrote:
>>>
>>>  Try:
>>>
>>>  until <something>; do
>>>   echo "waiting for something to do something"
>>>   sleep 5
>>> done
>>>
>>> You can put this in a bash file and run that.
>>>
>>> If you have a dockerfile would be easier to debug.
>>>
>>>
>>> -Jason
>>>
>>> On Apr 17, 2015, at 4:24 PM, Tyson Norris <[email protected]> wrote:
>>>
>>>   Yes, agreed that the command should not exit - but the container is
>>> killed at around 0.5 s after launch regardless of whether the command
>>> terminates, which is why I’ve been experimenting using commands with varied
>>> exit times.
>>>
>>>  For example, forget about the executor needing to register momentarily.
>>>
>>>  Using the command:
>>> echo testing123c && sleep 0.1 && echo testing456c
>>> -> I see the expected output in stdout, and the container is destroyed
>>> (as expected), because the container exits quickly, and then is destroyed
>>>
>>>  Using the command:
>>> echo testing123d && sleep 0.6 && echo testing456d
>>> -> I do NOT see the expected output in stdout (I only get testing123d),
>>> because the container is destroyed prematurely after ~0.5 seconds
>>>
>>>  Using the “real” storm command, I get no output in stdout, probably
>>> because no output is generated within 0.5 seconds of launch - it is a bit
>>> of a pig to startup, so I’m currently just trying to execute some other
>>> commands for testing purposes.
>>>
>>>  So I’m guessing this is a timeout issue, or else that the container is
>>> reaped inappropriately, or something else… looking through this code, I’m
>>> trying to figure out the steps take during executor launch:
>>>
>>> https://github.com/apache/mesos/blob/00318fc1b30fc0961c2dfa4d934c37866577d801/src/slave/containerizer/docker.cpp#L715
>>>
>>>  Thanks
>>> Tyson
>>>
>>>
>>>
>>>
>>>
>>>  On Apr 17, 2015, at 12:53 PM, Jason Giedymin <[email protected]>
>>> wrote:
>>>
>>> What is the last command you have docker doing?
>>>
>>> If that command exits then the docker will begin to end the container.
>>>
>>> -Jason
>>>
>>> On Apr 17, 2015, at 3:23 PM, Tyson Norris <[email protected]> wrote:
>>>
>>> Hi -
>>> I am looking at revving the mesos-storm framework to be dockerized (and
>>> simpler).
>>> I’m using mesos 0.22.0-1.0.ubuntu1404
>>> mesos master + mesos slave are deployed in docker containers, in case it
>>> matters.
>>>
>>> I have the storm (nimbus) framework launching fine as a docker
>>> container, but launching tasks for a topology is having problems related to
>>> using a docker-based executor.
>>>
>>> For example.
>>>
>>> TaskInfo task = TaskInfo.newBuilder()
>>>   .setName("worker " + slot.getNodeId() + ":" + slot.getPort())
>>>   .setTaskId(taskId)
>>>   .setSlaveId(offer.getSlaveId())
>>>   .setExecutor(ExecutorInfo.newBuilder()
>>>
>>>                   
>>> .setExecutorId(ExecutorID.newBuilder().setValue(details.getId()))
>>>                   .setData(ByteString.copyFromUtf8(executorDataStr))
>>>                   .setContainer(ContainerInfo.newBuilder()
>>>                           .setType(ContainerInfo.Type.DOCKER)
>>>
>>>                           .setDocker(ContainerInfo.DockerInfo.newBuilder()
>>>                                           .setImage("mesos-storm”)))
>>>                   
>>> .setCommand(CommandInfo.newBuilder().setShell(true).setValue("storm
>>> supervisor storm.mesos.MesosSupervisor"))
>>>       //rest is unchanged from existing mesos-storm framework code
>>>
>>> The executor launches and exits quickly - see the log msg:  Executor for
>>> container '88ce3658-7d9c-4b5f-b69a-cb5e48125dfd' has exited
>>>
>>> It seems like mesos loses track of the executor? I understand there is a
>>> 1 min timeout on registering the executor, but the exit happens well before
>>> 1 minute.
>>>
>>> I tried a few alternate commands to experiment, and I can see in the
>>> stdout for the task that
>>> "echo testing123 && echo testing456”
>>> prints to stdout correctly, both testing123 and testing456
>>>
>>> however:
>>> "echo testing123a && sleep 10 && echo testing456a”
>>> prints only testing123a, presumably because the container is lost and
>>> destroyed before the sleep time is up.
>>>
>>> So it’s like the container for the executor is only allowed to run for
>>> .5 seconds, then it is detected as exited, and the task is lost.
>>>
>>> Thanks for any advice.
>>>
>>> Tyson
>>>
>>>
>>>
>>> slave logs look like:
>>> mesosslave_1  | I0417 19:07:27.461230    11 slave.cpp:1121] Got assigned
>>> task mesos-slave1.service.consul-31000 for framework
>>> 20150417-190611-2801799596-5050-1-0000
>>> mesosslave_1  | I0417 19:07:27.461479    11 slave.cpp:1231] Launching
>>> task mesos-slave1.service.consul-31000 for framework
>>> 20150417-190611-2801799596-5050-1-0000
>>> mesosslave_1  | I0417 19:07:27.463250    11 slave.cpp:4160] Launching
>>> executor insights-1-1429297638 of framework
>>> 20150417-190611-2801799596-5050-1-0000 in work directory
>>> '/tmp/mesos/slaves/20150417-190611-2801799596-5050-1-S0/frameworks/20150417-190611-2801799596-5050-1-0000/executors/insights-1-1429297638/runs/6539127f-9dbb-425b-86a8-845b748f0cd3'
>>> mesosslave_1  | I0417 19:07:27.463444    11 slave.cpp:1378] Queuing task
>>> 'mesos-slave1.service.consul-31000' for executor insights-1-1429297638 of
>>> framework '20150417-190611-2801799596-5050-1-0000
>>> mesosslave_1  | I0417 19:07:27.467200     7 docker.cpp:755] Starting
>>> container '6539127f-9dbb-425b-86a8-845b748f0cd3' for executor
>>> 'insights-1-1429297638' and framework
>>> '20150417-190611-2801799596-5050-1-0000'
>>> mesosslave_1  | I0417 19:07:27.985935     7 docker.cpp:1333] Executor
>>> for container '6539127f-9dbb-425b-86a8-845b748f0cd3' has exited
>>> mesosslave_1  | I0417 19:07:27.986359     7 docker.cpp:1159] Destroying
>>> container '6539127f-9dbb-425b-86a8-845b748f0cd3'
>>> mesosslave_1  | I0417 19:07:27.986021     9 slave.cpp:3135] Monitoring
>>> executor 'insights-1-1429297638' of framework
>>> '20150417-190611-2801799596-5050-1-0000' in container
>>> '6539127f-9dbb-425b-86a8-845b748f0cd3'
>>> mesosslave_1  | I0417 19:07:27.986464     7 docker.cpp:1248] Running
>>> docker stop on container '6539127f-9dbb-425b-86a8-845b748f0cd3'
>>> mesosslave_1  | I0417 19:07:28.286761    10 slave.cpp:3186] Executor
>>> 'insights-1-1429297638' of framework 20150417-190611-2801799596-5050-1-0000
>>> has terminated with unknown status
>>> mesosslave_1  | I0417 19:07:28.288784    10 slave.cpp:2508] Handling
>>> status update TASK_LOST (UUID: 0795a58b-f487-42e2-aaa1-a26fe6834ed7) for
>>> task mesos-slave1.service.consul-31000 of framework
>>> 20150417-190611-2801799596-5050-1-0000 from @0.0.0.0:0
>>> mesosslave_1  | W0417 19:07:28.289227     9 docker.cpp:841] Ignoring
>>> updating unknown container: 6539127f-9dbb-425b-86a8-845b748f0cd3
>>>
>>> nimbus logs (framework) look like:
>>> 2015-04-17T19:07:28.302+0000 s.m.MesosNimbus [INFO] Received status
>>> update: task_id {
>>> value: "mesos-slave1.service.consul-31000"
>>> }
>>> state: TASK_LOST
>>> message: "Container terminated"
>>> slave_id {
>>> value: "20150417-190611-2801799596-5050-1-S0"
>>> }
>>> timestamp: 1.429297648286981E9
>>> source: SOURCE_SLAVE
>>> reason: REASON_EXECUTOR_TERMINATED
>>> 11: "\a\225\245\213\364\207B\342\252\241\242o\346\203N\327"
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
>

Re: docker based executor

Reply via email to