Hi Tyson, Glad you figured it out, sorry didn't realize you were running mesos slave in a docker (which surely complicates things).
I have a series of patches that is pending to be merged that will also make recovering tasks when relaunching mesos-slave in a docker works. Currently even with --pid=host when your slave dies your tasks are not able to recover when it restarts. Tim On Sat, Apr 18, 2015 at 10:32 PM, Tyson Norris <[email protected]> wrote: > Yes, this was the problem - sorry for the noise. > > For the record, running mesos-slave in a container requires "--pid=host” > option as mentioned in MESOS-2183 > > Now if docker-compose would just get released with the support for > setting pid flag, life would be easy... > > Thanks > Tyson > > On Apr 18, 2015, at 9:48 PM, Tyson Norris <[email protected]> wrote: > > I think I may be running into this: > https://issues.apache.org/jira/browse/MESOS-2183 > > I’m trying to get docker-compose to launch slave with --pid=host, but > having a few separate problems with that. > > I will update this thread when I’m able to test that. > > Thanks > Tyson > > On Apr 18, 2015, at 1:14 PM, Tyson Norris <[email protected]> wrote: > > Hi Tim - Actually, rereading your email: "For a test image like this you > want to set the CommandInfo with a ContainerInfo holding the docker image > instead.” it sounds like you are suggesting running the container as a task > command? But part of what I’m doing is trying to provide a custom executor, > so I think what I had before is appropriate - eventually I want to make the > tasks launch (same e.g. similar to existing mesos-storm framework), but I > am trying to launch the executor as a container instead of a script > command, which I think should be possible. > > So maybe you can comment on using a container within an ExecutorInfo as > below? > Docs here: > https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L267 > suggest that ContainerInfo and CommandInfo should be provided - I am using > setShell(false) to avoid changing the entry point, which already uses the > default "/bin/sh -c”. > > > Thanks > Tyson > > > On Apr 18, 2015, at 1:03 PM, Tyson Norris <[email protected]> wrote: > > Hi Tim - > I am using my own framework - a modified version of mesos-storm, > attempting to use docker containers instead of > > TaskInfo is like: > TaskInfo task = TaskInfo.newBuilder() > .setName("worker " + slot.getNodeId() + ":" + > slot.getPort()) > .setTaskId(taskId) > .setSlaveId(offer.getSlaveId()) > .setExecutor(ExecutorInfo.newBuilder() > > .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) > > .setData(ByteString.copyFromUtf8(executorDataStr)) > .setCommand(CommandInfo.newBuilder() > .setShell(false) > ) > > .setContainer(ContainerInfo.newBuilder() > > .setType(ContainerInfo.Type.DOCKER) > > .setDocker(ContainerInfo.DockerInfo.newBuilder() > .setImage("testexecutor") > ) > ) > > I understand this test image will be expected to fail - I expect it to > fail by registration timeout, and not by simply dying though. I’m only > using a test image, because I see the same behavior with my actual image > that properly handles mesos - executor registration protocol. > > I will try moving the Container inside the Command, and see if it > survives longer. > > I see now at > https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L675 > it mentions "Either ExecutorInfo or CommandInfo should be set” > > Thanks > Tyson > > > On Apr 18, 2015, at 12:38 PM, Tim Chen <[email protected]> wrote: > > That does seems odd, how did you run this via mesos? Are you using your > own framework or through another framework like Marathon? > > And what does the TaskInfo look like? > > Also note that if you're just testing a container, you don't want to set > the ExecutorInfo with a command as Executors in Mesos are expected to > communicate back to Mesos slave and implement the protocol between mesos > and executor. For a test image like this you want to set the CommandInfo > with a ContainerInfo holding the docker image instead. > > Tim > > On Sat, Apr 18, 2015 at 12:17 PM, Tyson Norris <[email protected]> wrote: > >> Hi Tim - >> Yes, I mentioned below when using a script like: >> -------------------------------------- >> #!/bin/bash >> until false; do >> echo "waiting for something to do something" >> sleep 0.2 >> done >> -------------------------------------- >> >> In my sandbox stdout I get exactly 2 lines: >> waiting for something to do something >> waiting for something to do something >> >> Running this container any other way, e.g. docker run --rm -it >> testexecutor, the output is an endless stream of "waiting for something to >> do something”. >> >> So something is stopping the container, as opposed to the container >> just exiting; at least that’s how it looks - I only get the container to >> stop when it is launched as an executor. >> >> Also, based on the docker logs, something is calling the >> /container/id/stop endpoint, *before* the /container/id/logs endpoint - so >> the stop is arriving before the logs are tailed, which also seems >> incorrect, and suggests that there is some code explicating stopping the >> container, instead of the container exiting itself. >> >> Thanks >> Tyson >> >> >> >> On Apr 18, 2015, at 3:33 AM, Tim Chen <[email protected]> wrote: >> >> Hi Tyson, >> >> The error message you saw in the logs about the executor exited >> actually just means the executor process has exited. >> >> Since you're launching a custom executor with MesosSupervisor, it seems >> like MesosSupervisor simply exited without reporting any task status. >> >> Can you look at what's the actual logs of the container? They can be >> found in the sandbox stdout and stderr logs. >> >> Tim >> >> On Fri, Apr 17, 2015 at 11:16 PM, Tyson Norris <[email protected]> wrote: >> >>> The sequence I see in the docker.log when my executor is launched is >>> something like: >>> GET /containers/id/json >>> POST /containers/id/wait >>> POST /containers/id/stop >>> GET /containers/id/logs >>> >>> So I’m wondering if the slave is calling docker->stop out of order in >>> slave/containerizer/docker.cpp >>> I only see it being called in recover and destroy and I don’t see logs >>> indicating either of those happening, but I may be missing something else >>> >>> Tyson >>> >>> On Apr 17, 2015, at 9:42 PM, Tyson Norris <[email protected]> wrote: >>> >>> mesos master INFO log says: >>> I0418 04:26:31.573763 6 master.cpp:3755] Sending 1 offers to >>> framework 20150411-165219-771756460-5050-1-0000 (marathon) at scheduler- >>> [email protected]:44364 >>> I0418 04:26:31.580003 9 master.cpp:2268] Processing ACCEPT call for >>> offers: [ 20150418-041001-553718188-5050-1-O165 ] on >>> slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051 >>> (mesos-slave1.service.consul) for framework >>> 20150411-165219-771756460-5050-1-0000 (marathon) at >>> [email protected]:44364 >>> I0418 04:26:31.580369 9 hierarchical.hpp:648] Recovered cpus(*):6; >>> mem(*):3862; disk(*):13483; ports(*):[31001-32000] (total allocatable: >>> cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000]) on slave >>> 20150418-041001-553718188-5050-1-S0 from >>> framework 20150411-165219-771756460-5050-1-0000 >>> I0418 04:26:32.480036 12 master.cpp:3388] Executor >>> insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 on >>> slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051 >>> (mesos-slave1.service.consul) terminated with signal Unknown signal 127 >>> >>> mesos slave INFO log says: >>> I0418 04:26:31.390650 8 slave.cpp:1231] Launching task >>> mesos-slave1.service.consul-31000 for framework >>> 20150418-041001-553718188-5050-1-0001 >>> I0418 04:26:31.392432 8 slave.cpp:4160] Launching executor >>> insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 >>> in work directory '/tmp/mesos/slaves/20150418-041001-553718188-5050- >>> >>> 1-S0/frameworks/20150418-041001-553718188-5050-1-0001/executors/insights-1-1429330829/runs/3cc411b0-c2e0-41ae-80c2-f0306371da5a' >>> I0418 04:26:31.392587 8 slave.cpp:1378] Queuing task >>> 'mesos-slave1.service.consul-31000' for executor insights-1-1429330829 >>> of framework '20150418-041001-553718188-5050-1-0001 >>> I0418 04:26:31.397415 7 docker.cpp:755] Starting container >>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a' for executor >>> 'insights-1-1429330829' and framework >>> '20150418-041001-553718188-5050-1-0001' >>> I0418 04:26:31.397835 7 fetcher.cpp:238] Fetching URIs using command >>> '/usr/libexec/mesos/mesos-fetcher' >>> I0418 04:26:32.177479 11 docker.cpp:1333] Executor for container >>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a' has exited >>> I0418 04:26:32.177817 11 docker.cpp:1159] Destroying container >>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a' >>> I0418 04:26:32.177999 11 docker.cpp:1248] Running docker stop on >>> container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' >>> I0418 04:26:32.177620 6 slave.cpp:3135] Monitoring executor >>> 'insights-1-1429330829' of framework >>> '20150418-041001-553718188-5050-1-0001' in container >>> '3cc411b0-c2e0-41ae-80c2-f0306371da5a' >>> I0418 04:26:32.477990 12 slave.cpp:3186] Executor >>> 'insights-1-1429330829' of framework 20150418-041001-553718188-5050-1-0001 >>> has terminated with unknown status >>> I0418 04:26:32.479394 12 slave.cpp:2508] Handling status update >>> TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task >>> mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188- >>> 5050-1-0001 from @0.0.0.0:0 >>> W0418 04:26:32.479645 12 docker.cpp:841] Ignoring updating unknown >>> container: 3cc411b0-c2e0-41ae-80c2-f0306371da5a >>> I0418 04:26:32.480041 10 status_update_manager.cpp:317] Received >>> status update TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for >>> task mesos-slave1.service.consul-31000 of framework 20150418-04 >>> 1001-553718188-5050-1-0001 >>> I0418 04:26:32.481073 12 slave.cpp:2753] Forwarding the update >>> TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task >>> mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188-5 >>> 050-1-0001 to [email protected]:5050 >>> >>> docker.log says >>> time="2015-04-18T04:26:31Z" level=debug msg="Calling POST >>> /containers/create" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="POST >>> /v1.18/containers/create?name=mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a" >>> >>> >>> time="2015-04-18T04:26:31Z" level=info msg="+job >>> create(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" >>> >>> >>> time="2015-04-18T04:26:31Z" level=info msg="+job log(create, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest)" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="-job log(create, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest) >>> = OK (0)" >>> time="2015-04-18T04:26:31Z" level=info msg="-job >>> create(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" >>> >>> time="2015-04-18T04:26:31Z" level=debug msg="Calling POST >>> /containers/{name:.*}/start" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="POST >>> /v1.18/containers/4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4/start" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="+job >>> start(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4)" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="+job log(start, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest)" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="-job log(start, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest) >>> = OK (0)" >>> time="2015-04-18T04:26:31Z" level=debug msg="Calling GET >>> /containers/{name:.*}/json" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="GET >>> /containers/4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4/json" >>> >>> time="2015-04-18T04:26:31Z" level=info msg="+job >>> container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> start(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) = >>> OK (0)" >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) >>> = OK (0)" >>> time="2015-04-18T04:26:32Z" level=debug msg="Calling GET >>> /containers/{name:.*}/json" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="GET >>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job >>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" >>> >>> time="2015-04-18T04:26:32Z" level=debug msg="Calling GET >>> /containers/{name:.*}/json" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="GET >>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job >>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" >>> >>> time="2015-04-18T04:26:32Z" level=debug msg="Calling POST >>> /containers/{name:.*}/wait" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="POST >>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/wait" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job >>> wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" >>> >>> time="2015-04-18T04:26:32Z" level=debug msg="Calling GET >>> /containers/{name:.*}/logs" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="GET >>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/logs?follow=1&stderr=1&stdout=1&tail=all" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job >>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job >>> logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" >>> >>> time="2015-04-18T04:26:32Z" level=debug msg="Calling POST >>> /containers/{name:.*}/stop" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="POST >>> /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/stop?t=0" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job >>> stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" >>> >>> time="2015-04-18T04:26:32Z" level=debug msg="Sending 15 to >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="Container >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4 failed to >>> exit within 0 seconds of SIGTERM - using the force" >>> time="2015-04-18T04:26:32Z" level=debug msg="Sending 9 to >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job log(die, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="-job log(die, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest) >>> = OK (0)" >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="+job log(stop, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest)" >>> >>> time="2015-04-18T04:26:32Z" level=info msg="-job log(stop, >>> 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, >>> testexecutor:latest) >>> = OK (0)" >>> time="2015-04-18T04:26:32Z" level=info msg="-job >>> stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)” >>> >>> >>> I don’t see a syslog for the master/slave containers >>> >>> Thanks >>> Tyson >>> >>> >>> >>> >>> On Apr 17, 2015, at 7:07 PM, Jason Giedymin <[email protected]> >>> wrote: >>> >>> What do any/all logs say? (syslog) >>> >>> -Jason >>> >>> On Apr 17, 2015, at 7:22 PM, Tyson Norris <[email protected]> wrote: >>> >>> another interesting fact: >>> I can restart the docker container of my executor, and it runs great. >>> >>> In the test example below, notice the stdout appears to be growing as >>> expected after restarting the container. >>> >>> So something is killing my executor container (also indicated by the >>> "Exited (137) About a minute ago”), but I’m still not sure what. >>> >>> Thanks >>> Tyson >>> >>> >>> >>> tnorris-osx:insights tnorris$ docker ps -a | grep testexec >>> 5291fe29c9c2 testexecutor:latest >>> "/bin/sh -c executor About a minute ago Exited >>> (137) About a minute ago >>> >>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 >>> tnorris-osx:insights tnorris$ docker start >>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 >>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 >>> tnorris-osx:insights tnorris$ docker logs >>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> waiting for something to do something >>> tnorris-osx:insights tnorris$ docker stop >>> mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 >>> >>> >>> On Apr 17, 2015, at 2:11 PM, Tyson Norris <[email protected]> wrote: >>> >>> You can reproduce with most any dockerfile, I think - it seems like >>> launching a customer executor that is a docker container has some problem. >>> >>> I just made a simple test with docker file: >>> -------------------------------------- >>> #this is oracle java8 atop phusion baseimage >>> FROM opentable/baseimage-java8:latest >>> >>> >>> #mesos lib (not used here, but will be in our “real” executor, e.g. to >>> register the executor etc) >>> RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv >>> E56151BF >>> RUN echo "deb http://repos.mesosphere.io/$(lsb_release -is | tr >>> '[:upper:]' '[:lower:]') $(lsb_release -cs) main" | tee >>> /etc/apt/sources.list.d/mesosphere.list >>> RUN cat /etc/apt/sources.list.d/mesosphere.list >>> RUN apt-get update && apt-get install -y \ >>> mesos >>> >>> ADD script.sh /usr/bin/executor-script.sh >>> >>> CMD executor-script.sh >>> -------------------------------------- >>> >>> and script.sh: >>> -------------------------------------- >>> #!/bin/bash >>> until false; do >>> echo "waiting for something to do something" >>> sleep 0.2 >>> done >>> -------------------------------------- >>> >>> And in my stdout I get exactly 2 lines: >>> waiting for something to do something >>> waiting for something to do something >>> >>> Which is how many lines can be output in within 0.5 seconds…something >>> is fishy about the 0.5 seconds, but I’m not sure where. >>> >>> I’m not sure exactly the difference, but launching a docker container >>> as a task WITHOUT a custom executor works fine, and I’m not sure about >>> launching a docker container as a task that is using a non-docker custom >>> executor. The case I’m trying for is using a docker customer executor, and >>> launching non-docker tasks. (in case that helps clarify the situation). >>> >>> Thanks >>> Tyson >>> >>> >>> >>> >>> >>> On Apr 17, 2015, at 1:47 PM, Jason Giedymin <[email protected]> >>> wrote: >>> >>> Try: >>> >>> until <something>; do >>> echo "waiting for something to do something" >>> sleep 5 >>> done >>> >>> You can put this in a bash file and run that. >>> >>> If you have a dockerfile would be easier to debug. >>> >>> >>> -Jason >>> >>> On Apr 17, 2015, at 4:24 PM, Tyson Norris <[email protected]> wrote: >>> >>> Yes, agreed that the command should not exit - but the container is >>> killed at around 0.5 s after launch regardless of whether the command >>> terminates, which is why I’ve been experimenting using commands with varied >>> exit times. >>> >>> For example, forget about the executor needing to register momentarily. >>> >>> Using the command: >>> echo testing123c && sleep 0.1 && echo testing456c >>> -> I see the expected output in stdout, and the container is destroyed >>> (as expected), because the container exits quickly, and then is destroyed >>> >>> Using the command: >>> echo testing123d && sleep 0.6 && echo testing456d >>> -> I do NOT see the expected output in stdout (I only get testing123d), >>> because the container is destroyed prematurely after ~0.5 seconds >>> >>> Using the “real” storm command, I get no output in stdout, probably >>> because no output is generated within 0.5 seconds of launch - it is a bit >>> of a pig to startup, so I’m currently just trying to execute some other >>> commands for testing purposes. >>> >>> So I’m guessing this is a timeout issue, or else that the container is >>> reaped inappropriately, or something else… looking through this code, I’m >>> trying to figure out the steps take during executor launch: >>> >>> https://github.com/apache/mesos/blob/00318fc1b30fc0961c2dfa4d934c37866577d801/src/slave/containerizer/docker.cpp#L715 >>> >>> Thanks >>> Tyson >>> >>> >>> >>> >>> >>> On Apr 17, 2015, at 12:53 PM, Jason Giedymin <[email protected]> >>> wrote: >>> >>> What is the last command you have docker doing? >>> >>> If that command exits then the docker will begin to end the container. >>> >>> -Jason >>> >>> On Apr 17, 2015, at 3:23 PM, Tyson Norris <[email protected]> wrote: >>> >>> Hi - >>> I am looking at revving the mesos-storm framework to be dockerized (and >>> simpler). >>> I’m using mesos 0.22.0-1.0.ubuntu1404 >>> mesos master + mesos slave are deployed in docker containers, in case it >>> matters. >>> >>> I have the storm (nimbus) framework launching fine as a docker >>> container, but launching tasks for a topology is having problems related to >>> using a docker-based executor. >>> >>> For example. >>> >>> TaskInfo task = TaskInfo.newBuilder() >>> .setName("worker " + slot.getNodeId() + ":" + slot.getPort()) >>> .setTaskId(taskId) >>> .setSlaveId(offer.getSlaveId()) >>> .setExecutor(ExecutorInfo.newBuilder() >>> >>> >>> .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) >>> .setData(ByteString.copyFromUtf8(executorDataStr)) >>> .setContainer(ContainerInfo.newBuilder() >>> .setType(ContainerInfo.Type.DOCKER) >>> >>> .setDocker(ContainerInfo.DockerInfo.newBuilder() >>> .setImage("mesos-storm”))) >>> >>> .setCommand(CommandInfo.newBuilder().setShell(true).setValue("storm >>> supervisor storm.mesos.MesosSupervisor")) >>> //rest is unchanged from existing mesos-storm framework code >>> >>> The executor launches and exits quickly - see the log msg: Executor for >>> container '88ce3658-7d9c-4b5f-b69a-cb5e48125dfd' has exited >>> >>> It seems like mesos loses track of the executor? I understand there is a >>> 1 min timeout on registering the executor, but the exit happens well before >>> 1 minute. >>> >>> I tried a few alternate commands to experiment, and I can see in the >>> stdout for the task that >>> "echo testing123 && echo testing456” >>> prints to stdout correctly, both testing123 and testing456 >>> >>> however: >>> "echo testing123a && sleep 10 && echo testing456a” >>> prints only testing123a, presumably because the container is lost and >>> destroyed before the sleep time is up. >>> >>> So it’s like the container for the executor is only allowed to run for >>> .5 seconds, then it is detected as exited, and the task is lost. >>> >>> Thanks for any advice. >>> >>> Tyson >>> >>> >>> >>> slave logs look like: >>> mesosslave_1 | I0417 19:07:27.461230 11 slave.cpp:1121] Got assigned >>> task mesos-slave1.service.consul-31000 for framework >>> 20150417-190611-2801799596-5050-1-0000 >>> mesosslave_1 | I0417 19:07:27.461479 11 slave.cpp:1231] Launching >>> task mesos-slave1.service.consul-31000 for framework >>> 20150417-190611-2801799596-5050-1-0000 >>> mesosslave_1 | I0417 19:07:27.463250 11 slave.cpp:4160] Launching >>> executor insights-1-1429297638 of framework >>> 20150417-190611-2801799596-5050-1-0000 in work directory >>> '/tmp/mesos/slaves/20150417-190611-2801799596-5050-1-S0/frameworks/20150417-190611-2801799596-5050-1-0000/executors/insights-1-1429297638/runs/6539127f-9dbb-425b-86a8-845b748f0cd3' >>> mesosslave_1 | I0417 19:07:27.463444 11 slave.cpp:1378] Queuing task >>> 'mesos-slave1.service.consul-31000' for executor insights-1-1429297638 of >>> framework '20150417-190611-2801799596-5050-1-0000 >>> mesosslave_1 | I0417 19:07:27.467200 7 docker.cpp:755] Starting >>> container '6539127f-9dbb-425b-86a8-845b748f0cd3' for executor >>> 'insights-1-1429297638' and framework >>> '20150417-190611-2801799596-5050-1-0000' >>> mesosslave_1 | I0417 19:07:27.985935 7 docker.cpp:1333] Executor >>> for container '6539127f-9dbb-425b-86a8-845b748f0cd3' has exited >>> mesosslave_1 | I0417 19:07:27.986359 7 docker.cpp:1159] Destroying >>> container '6539127f-9dbb-425b-86a8-845b748f0cd3' >>> mesosslave_1 | I0417 19:07:27.986021 9 slave.cpp:3135] Monitoring >>> executor 'insights-1-1429297638' of framework >>> '20150417-190611-2801799596-5050-1-0000' in container >>> '6539127f-9dbb-425b-86a8-845b748f0cd3' >>> mesosslave_1 | I0417 19:07:27.986464 7 docker.cpp:1248] Running >>> docker stop on container '6539127f-9dbb-425b-86a8-845b748f0cd3' >>> mesosslave_1 | I0417 19:07:28.286761 10 slave.cpp:3186] Executor >>> 'insights-1-1429297638' of framework 20150417-190611-2801799596-5050-1-0000 >>> has terminated with unknown status >>> mesosslave_1 | I0417 19:07:28.288784 10 slave.cpp:2508] Handling >>> status update TASK_LOST (UUID: 0795a58b-f487-42e2-aaa1-a26fe6834ed7) for >>> task mesos-slave1.service.consul-31000 of framework >>> 20150417-190611-2801799596-5050-1-0000 from @0.0.0.0:0 >>> mesosslave_1 | W0417 19:07:28.289227 9 docker.cpp:841] Ignoring >>> updating unknown container: 6539127f-9dbb-425b-86a8-845b748f0cd3 >>> >>> nimbus logs (framework) look like: >>> 2015-04-17T19:07:28.302+0000 s.m.MesosNimbus [INFO] Received status >>> update: task_id { >>> value: "mesos-slave1.service.consul-31000" >>> } >>> state: TASK_LOST >>> message: "Container terminated" >>> slave_id { >>> value: "20150417-190611-2801799596-5050-1-S0" >>> } >>> timestamp: 1.429297648286981E9 >>> source: SOURCE_SLAVE >>> reason: REASON_EXECUTOR_TERMINATED >>> 11: "\a\225\245\213\364\207B\342\252\241\242o\346\203N\327" >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> > > > > >

