mesos master INFO log says: I0418 04:26:31.573763 6 master.cpp:3755] Sending 1 offers to framework 20150411-165219-771756460-5050-1-0000 (marathon) at scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34<mailto:8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34>:44364 I0418 04:26:31.580003 9 master.cpp:2268] Processing ACCEPT call for offers: [ 20150418-041001-553718188-5050-1-O165 ] on slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051 (mesos-slave1.service.consul) for framework 20150411-165219-771756460-5050-1-0000 (marathon) at scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34<mailto:scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34>:44364 I0418 04:26:31.580369 9 hierarchical.hpp:648] Recovered cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000] (total allocatable: cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000]) on slave 20150418-041001-553718188-5050-1-S0 from framework 20150411-165219-771756460-5050-1-0000 I0418 04:26:32.480036 12 master.cpp:3388] Executor insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 on slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051 (mesos-slave1.service.consul) terminated with signal Unknown signal 127
mesos slave INFO log says: I0418 04:26:31.390650 8 slave.cpp:1231] Launching task mesos-slave1.service.consul-31000 for framework 20150418-041001-553718188-5050-1-0001 I0418 04:26:31.392432 8 slave.cpp:4160] Launching executor insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 in work directory '/tmp/mesos/slaves/20150418-041001-553718188-5050- 1-S0/frameworks/20150418-041001-553718188-5050-1-0001/executors/insights-1-1429330829/runs/3cc411b0-c2e0-41ae-80c2-f0306371da5a' I0418 04:26:31.392587 8 slave.cpp:1378] Queuing task 'mesos-slave1.service.consul-31000' for executor insights-1-1429330829 of framework '20150418-041001-553718188-5050-1-0001 I0418 04:26:31.397415 7 docker.cpp:755] Starting container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' for executor 'insights-1-1429330829' and framework '20150418-041001-553718188-5050-1-0001' I0418 04:26:31.397835 7 fetcher.cpp:238] Fetching URIs using command '/usr/libexec/mesos/mesos-fetcher' I0418 04:26:32.177479 11 docker.cpp:1333] Executor for container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' has exited I0418 04:26:32.177817 11 docker.cpp:1159] Destroying container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' I0418 04:26:32.177999 11 docker.cpp:1248] Running docker stop on container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' I0418 04:26:32.177620 6 slave.cpp:3135] Monitoring executor 'insights-1-1429330829' of framework '20150418-041001-553718188-5050-1-0001' in container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' I0418 04:26:32.477990 12 slave.cpp:3186] Executor 'insights-1-1429330829' of framework 20150418-041001-553718188-5050-1-0001 has terminated with unknown status I0418 04:26:32.479394 12 slave.cpp:2508] Handling status update TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188- 5050-1-0001 from @0.0.0.0:0 W0418 04:26:32.479645 12 docker.cpp:841] Ignoring updating unknown container: 3cc411b0-c2e0-41ae-80c2-f0306371da5a I0418 04:26:32.480041 10 status_update_manager.cpp:317] Received status update TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task mesos-slave1.service.consul-31000 of framework 20150418-04 1001-553718188-5050-1-0001 I0418 04:26:32.481073 12 slave.cpp:2753] Forwarding the update TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188-5 050-1-0001 to master@172.17.1.33<mailto:master@172.17.1.33>:5050 docker.log says time="2015-04-18T04:26:31Z" level=debug msg="Calling POST /containers/create" time="2015-04-18T04:26:31Z" level=info msg="POST /v1.18/containers/create?name=mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a" time="2015-04-18T04:26:31Z" level=info msg="+job create(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" time="2015-04-18T04:26:31Z" level=info msg="+job log(create, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest)" time="2015-04-18T04:26:31Z" level=info msg="-job log(create, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) = OK (0)" time="2015-04-18T04:26:31Z" level=info msg="-job create(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" time="2015-04-18T04:26:31Z" level=debug msg="Calling POST /containers/{name:.*}/start" time="2015-04-18T04:26:31Z" level=info msg="POST /v1.18/containers/4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4/start" time="2015-04-18T04:26:31Z" level=info msg="+job start(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4)" time="2015-04-18T04:26:31Z" level=info msg="+job log(start, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest)" time="2015-04-18T04:26:31Z" level=info msg="-job log(start, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) = OK (0)" time="2015-04-18T04:26:31Z" level=debug msg="Calling GET /containers/{name:.*}/json" time="2015-04-18T04:26:31Z" level=info msg="GET /containers/4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4/json" time="2015-04-18T04:26:31Z" level=info msg="+job container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4)" time="2015-04-18T04:26:32Z" level=info msg="-job start(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) = OK (0)" time="2015-04-18T04:26:32Z" level=info msg="-job container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) = OK (0)" time="2015-04-18T04:26:32Z" level=debug msg="Calling GET /containers/{name:.*}/json" time="2015-04-18T04:26:32Z" level=info msg="GET /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json" time="2015-04-18T04:26:32Z" level=info msg="+job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" time="2015-04-18T04:26:32Z" level=info msg="-job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" time="2015-04-18T04:26:32Z" level=debug msg="Calling GET /containers/{name:.*}/json" time="2015-04-18T04:26:32Z" level=info msg="GET /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json" time="2015-04-18T04:26:32Z" level=info msg="+job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" time="2015-04-18T04:26:32Z" level=info msg="-job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" time="2015-04-18T04:26:32Z" level=debug msg="Calling POST /containers/{name:.*}/wait" time="2015-04-18T04:26:32Z" level=info msg="POST /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/wait" time="2015-04-18T04:26:32Z" level=info msg="+job wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" time="2015-04-18T04:26:32Z" level=debug msg="Calling GET /containers/{name:.*}/logs" time="2015-04-18T04:26:32Z" level=info msg="GET /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/logs?follow=1&stderr=1&stdout=1&tail=all" time="2015-04-18T04:26:32Z" level=info msg="+job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" time="2015-04-18T04:26:32Z" level=info msg="-job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" time="2015-04-18T04:26:32Z" level=info msg="+job logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" time="2015-04-18T04:26:32Z" level=debug msg="Calling POST /containers/{name:.*}/stop" time="2015-04-18T04:26:32Z" level=info msg="POST /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/stop?t=0" time="2015-04-18T04:26:32Z" level=info msg="+job stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a)" time="2015-04-18T04:26:32Z" level=debug msg="Sending 15 to 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4" time="2015-04-18T04:26:32Z" level=info msg="Container 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4 failed to exit within 0 seconds of SIGTERM - using the force" time="2015-04-18T04:26:32Z" level=debug msg="Sending 9 to 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4" time="2015-04-18T04:26:32Z" level=info msg="+job log(die, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest)" time="2015-04-18T04:26:32Z" level=info msg="-job log(die, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) = OK (0)" time="2015-04-18T04:26:32Z" level=info msg="-job logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" time="2015-04-18T04:26:32Z" level=info msg="-job wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)" time="2015-04-18T04:26:32Z" level=info msg="+job log(stop, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest)" time="2015-04-18T04:26:32Z" level=info msg="-job log(stop, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) = OK (0)" time="2015-04-18T04:26:32Z" level=info msg="-job stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)” I don’t see a syslog for the master/slave containers Thanks Tyson On Apr 17, 2015, at 7:07 PM, Jason Giedymin <jason.giedy...@gmail.com<mailto:jason.giedy...@gmail.com>> wrote: What do any/all logs say? (syslog) -Jason On Apr 17, 2015, at 7:22 PM, Tyson Norris <tnor...@adobe.com<mailto:tnor...@adobe.com>> wrote: another interesting fact: I can restart the docker container of my executor, and it runs great. In the test example below, notice the stdout appears to be growing as expected after restarting the container. So something is killing my executor container (also indicated by the "Exited (137) About a minute ago”), but I’m still not sure what. Thanks Tyson tnorris-osx:insights tnorris$ docker ps -a | grep testexec 5291fe29c9c2 testexecutor:latest "/bin/sh -c executor About a minute ago Exited (137) About a minute ago mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 tnorris-osx:insights tnorris$ docker start mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 tnorris-osx:insights tnorris$ docker logs mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something waiting for something to do something tnorris-osx:insights tnorris$ docker stop mesos-f573677c-d0ee-4aa0-abba-40b7efc7cfe9 On Apr 17, 2015, at 2:11 PM, Tyson Norris <tnor...@adobe.com<mailto:tnor...@adobe.com>> wrote: You can reproduce with most any dockerfile, I think - it seems like launching a customer executor that is a docker container has some problem. I just made a simple test with docker file: -------------------------------------- #this is oracle java8 atop phusion baseimage FROM opentable/baseimage-java8:latest #mesos lib (not used here, but will be in our “real” executor, e.g. to register the executor etc) RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E56151BF RUN echo "deb http://repos.mesosphere.io/$(lsb_release -is | tr '[:upper:]' '[:lower:]') $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/mesosphere.list RUN cat /etc/apt/sources.list.d/mesosphere.list RUN apt-get update && apt-get install -y \ mesos ADD script.sh /usr/bin/executor-script.sh CMD executor-script.sh -------------------------------------- and script.sh: -------------------------------------- #!/bin/bash until false; do echo "waiting for something to do something" sleep 0.2 done -------------------------------------- And in my stdout I get exactly 2 lines: waiting for something to do something waiting for something to do something Which is how many lines can be output in within 0.5 seconds…something is fishy about the 0.5 seconds, but I’m not sure where. I’m not sure exactly the difference, but launching a docker container as a task WITHOUT a custom executor works fine, and I’m not sure about launching a docker container as a task that is using a non-docker custom executor. The case I’m trying for is using a docker customer executor, and launching non-docker tasks. (in case that helps clarify the situation). Thanks Tyson On Apr 17, 2015, at 1:47 PM, Jason Giedymin <jason.giedy...@gmail.com<mailto:jason.giedy...@gmail.com>> wrote: Try: until <something>; do echo "waiting for something to do something" sleep 5 done You can put this in a bash file and run that. If you have a dockerfile would be easier to debug. -Jason On Apr 17, 2015, at 4:24 PM, Tyson Norris <tnor...@adobe.com<mailto:tnor...@adobe.com>> wrote: Yes, agreed that the command should not exit - but the container is killed at around 0.5 s after launch regardless of whether the command terminates, which is why I’ve been experimenting using commands with varied exit times. For example, forget about the executor needing to register momentarily. Using the command: echo testing123c && sleep 0.1 && echo testing456c -> I see the expected output in stdout, and the container is destroyed (as expected), because the container exits quickly, and then is destroyed Using the command: echo testing123d && sleep 0.6 && echo testing456d -> I do NOT see the expected output in stdout (I only get testing123d), because the container is destroyed prematurely after ~0.5 seconds Using the “real” storm command, I get no output in stdout, probably because no output is generated within 0.5 seconds of launch - it is a bit of a pig to startup, so I’m currently just trying to execute some other commands for testing purposes. So I’m guessing this is a timeout issue, or else that the container is reaped inappropriately, or something else… looking through this code, I’m trying to figure out the steps take during executor launch: https://github.com/apache/mesos/blob/00318fc1b30fc0961c2dfa4d934c37866577d801/src/slave/containerizer/docker.cpp#L715 Thanks Tyson On Apr 17, 2015, at 12:53 PM, Jason Giedymin <jason.giedy...@gmail.com<mailto:jason.giedy...@gmail.com>> wrote: What is the last command you have docker doing? If that command exits then the docker will begin to end the container. -Jason On Apr 17, 2015, at 3:23 PM, Tyson Norris <tnor...@adobe.com<mailto:tnor...@adobe.com>> wrote: Hi - I am looking at revving the mesos-storm framework to be dockerized (and simpler). I’m using mesos 0.22.0-1.0.ubuntu1404 mesos master + mesos slave are deployed in docker containers, in case it matters. I have the storm (nimbus) framework launching fine as a docker container, but launching tasks for a topology is having problems related to using a docker-based executor. For example. TaskInfo task = TaskInfo.newBuilder() .setName("worker " + slot.getNodeId() + ":" + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage("mesos-storm”))) .setCommand(CommandInfo.newBuilder().setShell(true).setValue("storm supervisor storm.mesos.MesosSupervisor")) //rest is unchanged from existing mesos-storm framework code The executor launches and exits quickly - see the log msg: Executor for container '88ce3658-7d9c-4b5f-b69a-cb5e48125dfd' has exited It seems like mesos loses track of the executor? I understand there is a 1 min timeout on registering the executor, but the exit happens well before 1 minute. I tried a few alternate commands to experiment, and I can see in the stdout for the task that "echo testing123 && echo testing456” prints to stdout correctly, both testing123 and testing456 however: "echo testing123a && sleep 10 && echo testing456a” prints only testing123a, presumably because the container is lost and destroyed before the sleep time is up. So it’s like the container for the executor is only allowed to run for .5 seconds, then it is detected as exited, and the task is lost. Thanks for any advice. Tyson slave logs look like: mesosslave_1 | I0417 19:07:27.461230 11 slave.cpp:1121] Got assigned task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1-0000 mesosslave_1 | I0417 19:07:27.461479 11 slave.cpp:1231] Launching task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1-0000 mesosslave_1 | I0417 19:07:27.463250 11 slave.cpp:4160] Launching executor insights-1-1429297638 of framework 20150417-190611-2801799596-5050-1-0000 in work directory '/tmp/mesos/slaves/20150417-190611-2801799596-5050-1-S0/frameworks/20150417-190611-2801799596-5050-1-0000/executors/insights-1-1429297638/runs/6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:27.463444 11 slave.cpp:1378] Queuing task 'mesos-slave1.service.consul-31000' for executor insights-1-1429297638 of framework '20150417-190611-2801799596-5050-1-0000 mesosslave_1 | I0417 19:07:27.467200 7 docker.cpp:755] Starting container '6539127f-9dbb-425b-86a8-845b748f0cd3' for executor 'insights-1-1429297638' and framework '20150417-190611-2801799596-5050-1-0000' mesosslave_1 | I0417 19:07:27.985935 7 docker.cpp:1333] Executor for container '6539127f-9dbb-425b-86a8-845b748f0cd3' has exited mesosslave_1 | I0417 19:07:27.986359 7 docker.cpp:1159] Destroying container '6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:27.986021 9 slave.cpp:3135] Monitoring executor 'insights-1-1429297638' of framework '20150417-190611-2801799596-5050-1-0000' in container '6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:27.986464 7 docker.cpp:1248] Running docker stop on container '6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:28.286761 10 slave.cpp:3186] Executor 'insights-1-1429297638' of framework 20150417-190611-2801799596-5050-1-0000 has terminated with unknown status mesosslave_1 | I0417 19:07:28.288784 10 slave.cpp:2508] Handling status update TASK_LOST (UUID: 0795a58b-f487-42e2-aaa1-a26fe6834ed7) for task mesos-slave1.service.consul-31000 of framework 20150417-190611-2801799596-5050-1-0000 from @0.0.0.0:0 mesosslave_1 | W0417 19:07:28.289227 9 docker.cpp:841] Ignoring updating unknown container: 6539127f-9dbb-425b-86a8-845b748f0cd3 nimbus logs (framework) look like: 2015-04-17T19:07:28.302+0000 s.m.MesosNimbus [INFO] Received status update: task_id { value: "mesos-slave1.service.consul-31000" } state: TASK_LOST message: "Container terminated" slave_id { value: "20150417-190611-2801799596-5050-1-S0" } timestamp: 1.429297648286981E9 source: SOURCE_SLAVE reason: REASON_EXECUTOR_TERMINATED 11: "\a\225\245\213\364\207B\342\252\241\242o\346\203N\327"