Re: mesos slave in docker container
ugh. Thanks! I knew this was an issue, and completely ignored the fact that someone changed the name… Thanks - works fine now. Tyson On Jun 19, 2015, at 3:39 PM, Brian Devins badev...@gmail.commailto:badev...@gmail.com wrote: You can't name the container mesos-slave. The slave currently sees all containers prefixed with 'mesos-' as one it is supposed to administer so it is killing itself off since it doesn't match a task that should be running. On Fri, Jun 19, 2015 at 6:35 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi - Sorry for the delay, just getting back to this. Below is the command and stdout I get. I tried specifying just mesos containerizes, as this person mentioned https://github.com/mesosphere/coreos-setup/issues/5 and had similar results - works fine with mesos containerized, but not docker. Also I am only seeing this failing on a RHEL 7 + docker containerizer, works fine on ubuntu 14 with docker containerizer. Thanks Tyson [root@phx-8 ~]# docker run --rm -it \ --name mesos-slave \ --net host \ --pid host \ --privileged \ --env MESOS_CONTAINERIZERS=docker \ --env MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \ --env MESOS_HOSTNAME=192.168.8.8 \ --env MESOS_IP=192.168.8.8 \ --env MESOS_LOG_DIR=/var/log/mesos \ --env MESOS_LOGGING_LEVEL=INFO \ --env MESOS_MASTER=zk://zk1.service.consul:2181,zk2.service.consul:2181,zk3.service.consul:2181/mesos \ --env SERVICE_5051_NAME=mesos-slave \ --env MESOS_DOCKER_MESOS_IMAGE=docker.corp.adobe.com/tnorris/mesosslave:0.22.1-1.0.ubuntu1404http://docker.corp.adobe.com/tnorris/mesosslave:0.22.1-1.0.ubuntu1404 \ --env GLOG_v=1 \ --volume /var/run/docker.sock:/var/run/docker.sock \ --volume /sys:/sys:ro \ -p 0.0.0.0:5051:5051 \ --entrypoint mesos-slave \ docker.corp.adobe.com/tnorris/mesosslave:0.22.1-1.0.ubuntu1404http://docker.corp.adobe.com/tnorris/mesosslave:0.22.1-1.0.ubuntu1404 WARNING: Logging before InitGoogleLogging() is written to STDERR I0619 22:32:37.081535 8362 process.cpp:961] libprocess is initialized on 192.168.8.8:5051http://192.168.8.8:5051/ for 8 cpus I0619 22:32:37.160727 8362 logging.cpp:172] INFO level logging started! I0619 22:32:37.161571 8362 logging.cpp:177] Logging to /var/log/mesos I0619 22:32:37.161609 8362 main.cpp:156] Build: 2015-05-05 06:15:50 by root I0619 22:32:37.161629 8362 main.cpp:158] Version: 0.22.1 I0619 22:32:37.161639 8362 main.cpp:161] Git tag: 0.22.1 I0619 22:32:37.161650 8362 main.cpp:165] Git SHA: d6309f92a7f9af3ab61a878403e3d9c284ea87e0 2015-06-19 22:32:40,196:8362(0x7f38e66b0700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 2015-06-19 22:32:40,196:8362(0x7f38e66b0700):ZOO_INFO@log_env@716: Client environment:host.namehttp://host.name/=phx-8.corp.adobe.comhttp://phx-8.corp.adobe.com/ 2015-06-19 22:32:40,196:8362(0x7f38e66b0700):ZOO_INFO@log_env@723: Client environment:os.namehttp://os.name/=Linux 2015-06-19 22:32:40,196:8362(0x7f38e66b0700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-123.el7.x86_64 2015-06-19 22:32:40,196:8362(0x7f38e66b0700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Mon May 5 11:16:57 EDT 2014 I0619 22:32:40.196564 8362 main.cpp:200] Starting Mesos slave I0619 22:32:40.203459 8362 slave.cpp:174] Slave started on 1)@192.168.8.8:5051http://192.168.8.8:5051/ I0619 22:32:40.205621 8362 slave.cpp:322] Slave resources: cpus(*):4; mem(*):14864; disk(*):4975; ports(*):[31000-32000] I0619 22:32:40.206074 8362 slave.cpp:351] Slave hostname: 192.168.8.8 I0619 22:32:40.206116 8362 slave.cpp:352] Slave checkpoint: true 2015-06-19 22:32:40,208:8362(0x7f38e66b0700):ZOO_INFO@log_env@733: Client environment:user.namehttp://user.name/=(null) 2015-06-19 22:32:40,208:8362(0x7f38e66b0700):ZOO_INFO@log_env@741: Client environment:user.home=/root 2015-06-19 22:32:40,208:8362(0x7f38e66b0700):ZOO_INFO@log_env@753: Client environment:user.dir=/ 2015-06-19 22:32:40,208:8362(0x7f38e66b0700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=zk1.service.consul:2181,zk2.service.consul:2181,zk3.service.consul:2181 sessionTimeout=1 watcher=0x7f38ea110a60 sessionId=0 sessionPasswd=null context=0x7f38d4000ea0 flags=0 I0619 22:32:40.208984 8367 state.cpp:35] Recovering state from '/tmp/mesos/meta' I0619 22:32:40.209102 8367 slave.cpp:600] Successfully attached file '/var/log/mesos/mesos-slave.INFO' I0619 22:32:40.209174 8367 status_update_manager.cpp:197] Recovering status update manager I0619 22:32:40.230962 8367 docker.cpp:423] Recovering Docker containers I0619 22:32:40.231061 8367 docker.cpp:697] Running docker ps -a 2015-06-19 22:32:40,252:8362(0x7f38e1a4b700):ZOO_INFO@check_events@1703: initiated connection to server [192.168.8.3:2181http://192.168.8.3:2181/] 2015-06-19 22:32:40,256:8362
mesos slave in docker container
Hi - We are running mesos slave (0.22.0-1.0.ubuntu1404) in a docker container with docker containerizer without problems on ubuntu 14.04 docker host (with lxc-docker pkg etc added). Running the same slave container on RHEL 7.0 docker host, the container exits almost immediately after starting with: I0613 07:18:15.161931 5303 slave.cpp:3808] Finished recovery I0613 07:18:15.162677 5303 slave.cpp:647] New master detected at master@192.168.8.5:5050 I0613 07:18:15.162753 5301 status_update_manager.cpp:171] Pausing sending status updates I0613 07:18:15.163051 5303 slave.cpp:672] No credentials provided. Attempting to register without authentication I0613 07:18:15.163734 5303 slave.cpp:683] Detecting new master W0613 07:18:15.163734 5293 logging.cpp:81] RAW: Received signal SIGTERM from process 1166 of user 0; exiting If I do not enable the docker containerized, the slave container runs fine. Other containers that bind mount /var/run/docker.sock also run fine. Debug docker logs are below. One difference between the ubuntu docker host and RHEL docker host is that the unbuntu host uses the aufs driver, while rhel uses devicemapper, and selinux is enabled in RHEL but not ubuntu. Thanks for any advice! Tyson Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=info msg=POST /v1.18/containers/9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd/start Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=info msg=+job start(9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd) Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=activateDeviceIfNeeded(9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd) Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=libdevmapper(6): ioctl/libdm-iface.c:1750 (4) dm info docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd OF [16384] (*1) Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=libdevmapper(6): ioctl/libdm-iface.c:1750 (4) dm create docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd OF [16384] (*1) Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=libdevmapper(6): libdm-common.c:1348 (4) docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd: Stacking NODE_ADD (253,9) 0:0 0600 [verify_udev] Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=libdevmapper(6): ioctl/libdm-iface.c:1750 (4) dm reload docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd OF [16384] (*1) Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=libdevmapper(6): ioctl/libdm-iface.c:1750 (4) dm resume docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd OF [16384] (*1) Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=libdevmapper(6): libdm-common.c:1348 (4) docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd: Processing NODE_ADD (253,9) 0:0 0600 [verify_udev] Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=libdevmapper(6): libdm-common.c:983 (4) Created /dev/mapper/docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd Jun 13 07:28:26 phx-8 kernel: EXT4-fs (dm-9): mounted filesystem with ordered data mode. Opts: discard Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=info msg=+job log(start, 9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd, docker.corp.adobe.com/tnorris/mesosslave:0.22.1-1.0.ubuntu1404) Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=info msg=-job log(start, 9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd, docker.corp.adobe.com/tnorris/mesosslave:0.22.1-1.0.ubuntu1404) = OK (0) Jun 13 07:28:26 phx-8 systemd-udevd: conflicting device node '/dev/mapper/docker-253:3-16818501-9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd' found, link to '/dev/dm-9' will not be created Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=debug msg=Calling GET /containers/{name:.*}/json Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=info msg=GET /containers/9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd/json Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=info msg=+job container_inspect(9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd) Jun 13 07:28:26 phx-8 systemd: Starting docker container 9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd. Jun 13 07:28:26 phx-8 systemd: Started docker container 9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd. Jun 13 07:28:26 phx-8 docker: time=2015-06-13T07:28:26Z level=info msg=-job start(9e897d0fd156dab5ec59f8ded2a6cdf7dc5379664c872cd7da4875b6aab9dfcd) = OK
Re: docker based executor
Ah, after reading some info at https://tnachen.wordpress.com/2014/08/19/docker-in-mesos-0-20/ I see that I should probably be setting my slave container to run with --net=host - with that it is working now. Are the changes for https://issues.apache.org/jira/browse/MESOS-2183 going to allow slave+executor to run with --net=bridge? Thanks! Tyson On Apr 18, 2015, at 10:43 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Tyson, Glad you figured it out, sorry didn't realize you were running mesos slave in a docker (which surely complicates things). I have a series of patches that is pending to be merged that will also make recovering tasks when relaunching mesos-slave in a docker works. Currently even with --pid=host when your slave dies your tasks are not able to recover when it restarts. Tim On Sat, Apr 18, 2015 at 10:32 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Yes, this was the problem - sorry for the noise. For the record, running mesos-slave in a container requires --pid=host” option as mentioned in MESOS-2183 Now if docker-compose would just get released with the support for setting pid flag, life would be easy... Thanks Tyson On Apr 18, 2015, at 9:48 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: I think I may be running into this: https://issues.apache.org/jira/browse/MESOS-2183 I’m trying to get docker-compose to launch slave with --pid=host, but having a few separate problems with that. I will update this thread when I’m able to test that. Thanks Tyson On Apr 18, 2015, at 1:14 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - Actually, rereading your email: For a test image like this you want to set the CommandInfo with a ContainerInfo holding the docker image instead.” it sounds like you are suggesting running the container as a task command? But part of what I’m doing is trying to provide a custom executor, so I think what I had before is appropriate - eventually I want to make the tasks launch (same e.g. similar to existing mesos-storm framework), but I am trying to launch the executor as a container instead of a script command, which I think should be possible. So maybe you can comment on using a container within an ExecutorInfo as below? Docs here: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L267 suggest that ContainerInfo and CommandInfo should be provided - I am using setShell(false) to avoid changing the entry point, which already uses the default /bin/sh -c”. Thanks Tyson On Apr 18, 2015, at 1:03 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - I am using my own framework - a modified version of mesos-storm, attempting to use docker containers instead of TaskInfo is like: TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setCommand(CommandInfo.newBuilder() .setShell(false) ) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(testexecutor) ) ) I understand this test image will be expected to fail - I expect it to fail by registration timeout, and not by simply dying though. I’m only using a test image, because I see the same behavior with my actual image that properly handles mesos - executor registration protocol. I will try moving the Container inside the Command, and see if it survives longer. I see now at https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L675 it mentions Either ExecutorInfo or CommandInfo should be set” Thanks Tyson On Apr 18, 2015, at 12:38 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: That does seems odd, how did you run this via mesos? Are you using your own framework or through another framework like Marathon? And what does the TaskInfo look like? Also note that if you're just testing a container, you don't want to set the ExecutorInfo with a command as Executors in Mesos are expected to communicate back to Mesos slave and implement the protocol between mesos and executor. For a test image like this you want to set the CommandInfo with a ContainerInfo holding the docker image instead
Re: docker based executor
Hi Tim - Yes, I mentioned below when using a script like: -- #!/bin/bash until false; do echo waiting for something to do something sleep 0.2 done -- In my sandbox stdout I get exactly 2 lines: waiting for something to do something waiting for something to do something Running this container any other way, e.g. docker run --rm -it testexecutor, the output is an endless stream of waiting for something to do something”. So something is stopping the container, as opposed to the container just exiting; at least that’s how it looks - I only get the container to stop when it is launched as an executor. Also, based on the docker logs, something is calling the /container/id/stop endpoint, *before* the /container/id/logs endpoint - so the stop is arriving before the logs are tailed, which also seems incorrect, and suggests that there is some code explicating stopping the container, instead of the container exiting itself. Thanks Tyson On Apr 18, 2015, at 3:33 AM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Tyson, The error message you saw in the logs about the executor exited actually just means the executor process has exited. Since you're launching a custom executor with MesosSupervisor, it seems like MesosSupervisor simply exited without reporting any task status. Can you look at what's the actual logs of the container? They can be found in the sandbox stdout and stderr logs. Tim On Fri, Apr 17, 2015 at 11:16 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: The sequence I see in the docker.log when my executor is launched is something like: GET /containers/id/json POST /containers/id/wait POST /containers/id/stop GET /containers/id/logs So I’m wondering if the slave is calling docker-stop out of order in slave/containerizer/docker.cpp I only see it being called in recover and destroy and I don’t see logs indicating either of those happening, but I may be missing something else Tyson On Apr 17, 2015, at 9:42 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: mesos master INFO log says: I0418 04:26:31.573763 6 master.cpp:3755] Sending 1 offers to framework 20150411-165219-771756460-5050-1- (marathon) at scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34mailto:8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34:44364 I0418 04:26:31.580003 9 master.cpp:2268] Processing ACCEPT call for offers: [ 20150418-041001-553718188-5050-1-O165 ] on slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051http://172.17.1.35:5051/ (mesos-slave1.service.consul) for framework 20150411-165219-771756460-5050-1- (marathon) at scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34mailto:scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34:44364 I0418 04:26:31.580369 9 hierarchical.hpp:648] Recovered cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000] (total allocatable: cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000]) on slave 20150418-041001-553718188-5050-1-S0 from framework 20150411-165219-771756460-5050-1- I0418 04:26:32.48003612 master.cpp:3388] Executor insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 on slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051http://172.17.1.35:5051/ (mesos-slave1.service.consul) terminated with signal Unknown signal 127 mesos slave INFO log says: I0418 04:26:31.390650 8 slave.cpp:1231] Launching task mesos-slave1.service.consul-31000 for framework 20150418-041001-553718188-5050-1-0001 I0418 04:26:31.392432 8 slave.cpp:4160] Launching executor insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 in work directory '/tmp/mesos/slaves/20150418-041001-553718188-5050- 1-S0/frameworks/20150418-041001-553718188-5050-1-0001/executors/insights-1-1429330829/runs/3cc411b0-c2e0-41ae-80c2-f0306371da5a' I0418 04:26:31.392587 8 slave.cpp:1378] Queuing task 'mesos-slave1.service.consul-31000' for executor insights-1-1429330829 of framework '20150418-041001-553718188-5050-1-0001 I0418 04:26:31.397415 7 docker.cpp:755] Starting container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' for executor 'insights-1-1429330829' and framework '20150418-041001-553718188-5050-1-0001' I0418 04:26:31.397835 7 fetcher.cpp:238] Fetching URIs using command '/usr/libexec/mesos/mesos-fetcher' I0418 04:26:32.17747911 docker.cpp:1333] Executor for container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' has exited I0418 04:26:32.17781711 docker.cpp:1159] Destroying container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' I0418 04:26:32.17799911 docker.cpp:1248] Running docker stop on container '3cc411b0-c2e0-41ae-80c2-f0306371da5a' I0418 04:26:32.177620 6 slave.cpp:3135] Monitoring executor 'insights-1-1429330829' of framework '20150418-041001-553718188-5050-1-0001' in container '3cc411b0-c2e0
Re: docker based executor
Hi Tim - I am using my own framework - a modified version of mesos-storm, attempting to use docker containers instead of TaskInfo is like: TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setCommand(CommandInfo.newBuilder() .setShell(false) ) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(testexecutor) ) ) I understand this test image will be expected to fail - I expect it to fail by registration timeout, and not by simply dying though. I’m only using a test image, because I see the same behavior with my actual image that properly handles mesos - executor registration protocol. I will try moving the Container inside the Command, and see if it survives longer. I see now at https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L675 it mentions Either ExecutorInfo or CommandInfo should be set” Thanks Tyson On Apr 18, 2015, at 12:38 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: That does seems odd, how did you run this via mesos? Are you using your own framework or through another framework like Marathon? And what does the TaskInfo look like? Also note that if you're just testing a container, you don't want to set the ExecutorInfo with a command as Executors in Mesos are expected to communicate back to Mesos slave and implement the protocol between mesos and executor. For a test image like this you want to set the CommandInfo with a ContainerInfo holding the docker image instead. Tim On Sat, Apr 18, 2015 at 12:17 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - Yes, I mentioned below when using a script like: -- #!/bin/bash until false; do echo waiting for something to do something sleep 0.2 done -- In my sandbox stdout I get exactly 2 lines: waiting for something to do something waiting for something to do something Running this container any other way, e.g. docker run --rm -it testexecutor, the output is an endless stream of waiting for something to do something”. So something is stopping the container, as opposed to the container just exiting; at least that’s how it looks - I only get the container to stop when it is launched as an executor. Also, based on the docker logs, something is calling the /container/id/stop endpoint, *before* the /container/id/logs endpoint - so the stop is arriving before the logs are tailed, which also seems incorrect, and suggests that there is some code explicating stopping the container, instead of the container exiting itself. Thanks Tyson On Apr 18, 2015, at 3:33 AM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Tyson, The error message you saw in the logs about the executor exited actually just means the executor process has exited. Since you're launching a custom executor with MesosSupervisor, it seems like MesosSupervisor simply exited without reporting any task status. Can you look at what's the actual logs of the container? They can be found in the sandbox stdout and stderr logs. Tim On Fri, Apr 17, 2015 at 11:16 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: The sequence I see in the docker.log when my executor is launched is something like: GET /containers/id/json POST /containers/id/wait POST /containers/id/stop GET /containers/id/logs So I’m wondering if the slave is calling docker-stop out of order in slave/containerizer/docker.cpp I only see it being called in recover and destroy and I don’t see logs indicating either of those happening, but I may be missing something else Tyson On Apr 17, 2015, at 9:42 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: mesos master INFO log says: I0418 04:26:31.573763 6 master.cpp:3755] Sending 1 offers to framework 20150411-165219-771756460-5050-1- (marathon) at scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34mailto:8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34:44364 I0418 04:26:31.580003 9 master.cpp:2268] Processing ACCEPT call for offers: [ 20150418-041001-553718188-5050-1-O165 ] on slave 20150418-041001
Re: docker based executor
Hi Tim - Actually, rereading your email: For a test image like this you want to set the CommandInfo with a ContainerInfo holding the docker image instead.” it sounds like you are suggesting running the container as a task command? But part of what I’m doing is trying to provide a custom executor, so I think what I had before is appropriate - eventually I want to make the tasks launch (same e.g. similar to existing mesos-storm framework), but I am trying to launch the executor as a container instead of a script command, which I think should be possible. So maybe you can comment on using a container within an ExecutorInfo as below? Docs here: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L267 suggest that ContainerInfo and CommandInfo should be provided - I am using setShell(false) to avoid changing the entry point, which already uses the default /bin/sh -c”. Thanks Tyson On Apr 18, 2015, at 1:03 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - I am using my own framework - a modified version of mesos-storm, attempting to use docker containers instead of TaskInfo is like: TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setCommand(CommandInfo.newBuilder() .setShell(false) ) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(testexecutor) ) ) I understand this test image will be expected to fail - I expect it to fail by registration timeout, and not by simply dying though. I’m only using a test image, because I see the same behavior with my actual image that properly handles mesos - executor registration protocol. I will try moving the Container inside the Command, and see if it survives longer. I see now at https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L675 it mentions Either ExecutorInfo or CommandInfo should be set” Thanks Tyson On Apr 18, 2015, at 12:38 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: That does seems odd, how did you run this via mesos? Are you using your own framework or through another framework like Marathon? And what does the TaskInfo look like? Also note that if you're just testing a container, you don't want to set the ExecutorInfo with a command as Executors in Mesos are expected to communicate back to Mesos slave and implement the protocol between mesos and executor. For a test image like this you want to set the CommandInfo with a ContainerInfo holding the docker image instead. Tim On Sat, Apr 18, 2015 at 12:17 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - Yes, I mentioned below when using a script like: -- #!/bin/bash until false; do echo waiting for something to do something sleep 0.2 done -- In my sandbox stdout I get exactly 2 lines: waiting for something to do something waiting for something to do something Running this container any other way, e.g. docker run --rm -it testexecutor, the output is an endless stream of waiting for something to do something”. So something is stopping the container, as opposed to the container just exiting; at least that’s how it looks - I only get the container to stop when it is launched as an executor. Also, based on the docker logs, something is calling the /container/id/stop endpoint, *before* the /container/id/logs endpoint - so the stop is arriving before the logs are tailed, which also seems incorrect, and suggests that there is some code explicating stopping the container, instead of the container exiting itself. Thanks Tyson On Apr 18, 2015, at 3:33 AM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Tyson, The error message you saw in the logs about the executor exited actually just means the executor process has exited. Since you're launching a custom executor with MesosSupervisor, it seems like MesosSupervisor simply exited without reporting any task status. Can you look at what's the actual logs of the container? They can be found in the sandbox stdout and stderr logs. Tim On Fri, Apr 17, 2015 at 11:16 PM, Tyson Norris
Re: docker based executor
I think I may be running into this: https://issues.apache.org/jira/browse/MESOS-2183 I’m trying to get docker-compose to launch slave with --pid=host, but having a few separate problems with that. I will update this thread when I’m able to test that. Thanks Tyson On Apr 18, 2015, at 1:14 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - Actually, rereading your email: For a test image like this you want to set the CommandInfo with a ContainerInfo holding the docker image instead.” it sounds like you are suggesting running the container as a task command? But part of what I’m doing is trying to provide a custom executor, so I think what I had before is appropriate - eventually I want to make the tasks launch (same e.g. similar to existing mesos-storm framework), but I am trying to launch the executor as a container instead of a script command, which I think should be possible. So maybe you can comment on using a container within an ExecutorInfo as below? Docs here: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L267 suggest that ContainerInfo and CommandInfo should be provided - I am using setShell(false) to avoid changing the entry point, which already uses the default /bin/sh -c”. Thanks Tyson On Apr 18, 2015, at 1:03 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - I am using my own framework - a modified version of mesos-storm, attempting to use docker containers instead of TaskInfo is like: TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setCommand(CommandInfo.newBuilder() .setShell(false) ) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(testexecutor) ) ) I understand this test image will be expected to fail - I expect it to fail by registration timeout, and not by simply dying though. I’m only using a test image, because I see the same behavior with my actual image that properly handles mesos - executor registration protocol. I will try moving the Container inside the Command, and see if it survives longer. I see now at https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L675 it mentions Either ExecutorInfo or CommandInfo should be set” Thanks Tyson On Apr 18, 2015, at 12:38 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: That does seems odd, how did you run this via mesos? Are you using your own framework or through another framework like Marathon? And what does the TaskInfo look like? Also note that if you're just testing a container, you don't want to set the ExecutorInfo with a command as Executors in Mesos are expected to communicate back to Mesos slave and implement the protocol between mesos and executor. For a test image like this you want to set the CommandInfo with a ContainerInfo holding the docker image instead. Tim On Sat, Apr 18, 2015 at 12:17 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi Tim - Yes, I mentioned below when using a script like: -- #!/bin/bash until false; do echo waiting for something to do something sleep 0.2 done -- In my sandbox stdout I get exactly 2 lines: waiting for something to do something waiting for something to do something Running this container any other way, e.g. docker run --rm -it testexecutor, the output is an endless stream of waiting for something to do something”. So something is stopping the container, as opposed to the container just exiting; at least that’s how it looks - I only get the container to stop when it is launched as an executor. Also, based on the docker logs, something is calling the /container/id/stop endpoint, *before* the /container/id/logs endpoint - so the stop is arriving before the logs are tailed, which also seems incorrect, and suggests that there is some code explicating stopping the container, instead of the container exiting itself. Thanks Tyson On Apr 18, 2015, at 3:33 AM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Tyson, The error message you saw in the logs about the executor exited
docker based executor
Hi - I am looking at revving the mesos-storm framework to be dockerized (and simpler). I’m using mesos 0.22.0-1.0.ubuntu1404 mesos master + mesos slave are deployed in docker containers, in case it matters. I have the storm (nimbus) framework launching fine as a docker container, but launching tasks for a topology is having problems related to using a docker-based executor. For example. TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(mesos-storm”))) .setCommand(CommandInfo.newBuilder().setShell(true).setValue(storm supervisor storm.mesos.MesosSupervisor)) //rest is unchanged from existing mesos-storm framework code The executor launches and exits quickly - see the log msg: Executor for container '88ce3658-7d9c-4b5f-b69a-cb5e48125dfd' has exited It seems like mesos loses track of the executor? I understand there is a 1 min timeout on registering the executor, but the exit happens well before 1 minute. I tried a few alternate commands to experiment, and I can see in the stdout for the task that echo testing123 echo testing456” prints to stdout correctly, both testing123 and testing456 however: echo testing123a sleep 10 echo testing456a” prints only testing123a, presumably because the container is lost and destroyed before the sleep time is up. So it’s like the container for the executor is only allowed to run for .5 seconds, then it is detected as exited, and the task is lost. Thanks for any advice. Tyson slave logs look like: mesosslave_1 | I0417 19:07:27.46123011 slave.cpp:1121] Got assigned task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.46147911 slave.cpp:1231] Launching task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.46325011 slave.cpp:4160] Launching executor insights-1-1429297638 of framework 20150417-190611-2801799596-5050-1- in work directory '/tmp/mesos/slaves/20150417-190611-2801799596-5050-1-S0/frameworks/20150417-190611-2801799596-5050-1-/executors/insights-1-1429297638/runs/6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:27.46344411 slave.cpp:1378] Queuing task 'mesos-slave1.service.consul-31000' for executor insights-1-1429297638 of framework '20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.467200 7 docker.cpp:755] Starting container '6539127f-9dbb-425b-86a8-845b748f0cd3' for executor 'insights-1-1429297638' and framework '20150417-190611-2801799596-5050-1-' mesosslave_1 | I0417 19:07:27.985935 7 docker.cpp:1333] Executor for container '6539127f-9dbb-425b-86a8-845b748f0cd3' has exited mesosslave_1 | I0417 19:07:27.986359 7 docker.cpp:1159] Destroying container '6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:27.986021 9 slave.cpp:3135] Monitoring executor 'insights-1-1429297638' of framework '20150417-190611-2801799596-5050-1-' in container '6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:27.986464 7 docker.cpp:1248] Running docker stop on container '6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:28.28676110 slave.cpp:3186] Executor 'insights-1-1429297638' of framework 20150417-190611-2801799596-5050-1- has terminated with unknown status mesosslave_1 | I0417 19:07:28.28878410 slave.cpp:2508] Handling status update TASK_LOST (UUID: 0795a58b-f487-42e2-aaa1-a26fe6834ed7) for task mesos-slave1.service.consul-31000 of framework 20150417-190611-2801799596-5050-1- from @0.0.0.0:0 mesosslave_1 | W0417 19:07:28.289227 9 docker.cpp:841] Ignoring updating unknown container: 6539127f-9dbb-425b-86a8-845b748f0cd3 nimbus logs (framework) look like: 2015-04-17T19:07:28.302+ s.m.MesosNimbus [INFO] Received status update: task_id { value: mesos-slave1.service.consul-31000 } state: TASK_LOST message: Container terminated slave_id { value: 20150417-190611-2801799596-5050-1-S0 } timestamp: 1.429297648286981E9 source: SOURCE_SLAVE reason: REASON_EXECUTOR_TERMINATED 11: \a\225\245\213\364\207B\342\252\241\242o\346\203N\327
Re: docker based executor
Yes, agreed that the command should not exit - but the container is killed at around 0.5 s after launch regardless of whether the command terminates, which is why I’ve been experimenting using commands with varied exit times. For example, forget about the executor needing to register momentarily. Using the command: echo testing123c sleep 0.1 echo testing456c - I see the expected output in stdout, and the container is destroyed (as expected), because the container exits quickly, and then is destroyed Using the command: echo testing123d sleep 0.6 echo testing456d - I do NOT see the expected output in stdout (I only get testing123d), because the container is destroyed prematurely after ~0.5 seconds Using the “real” storm command, I get no output in stdout, probably because no output is generated within 0.5 seconds of launch - it is a bit of a pig to startup, so I’m currently just trying to execute some other commands for testing purposes. So I’m guessing this is a timeout issue, or else that the container is reaped inappropriately, or something else… looking through this code, I’m trying to figure out the steps take during executor launch: https://github.com/apache/mesos/blob/00318fc1b30fc0961c2dfa4d934c37866577d801/src/slave/containerizer/docker.cpp#L715 Thanks Tyson On Apr 17, 2015, at 12:53 PM, Jason Giedymin jason.giedy...@gmail.commailto:jason.giedy...@gmail.com wrote: What is the last command you have docker doing? If that command exits then the docker will begin to end the container. -Jason On Apr 17, 2015, at 3:23 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi - I am looking at revving the mesos-storm framework to be dockerized (and simpler). I’m using mesos 0.22.0-1.0.ubuntu1404 mesos master + mesos slave are deployed in docker containers, in case it matters. I have the storm (nimbus) framework launching fine as a docker container, but launching tasks for a topology is having problems related to using a docker-based executor. For example. TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(mesos-storm”))) .setCommand(CommandInfo.newBuilder().setShell(true).setValue(storm supervisor storm.mesos.MesosSupervisor)) //rest is unchanged from existing mesos-storm framework code The executor launches and exits quickly - see the log msg: Executor for container '88ce3658-7d9c-4b5f-b69a-cb5e48125dfd' has exited It seems like mesos loses track of the executor? I understand there is a 1 min timeout on registering the executor, but the exit happens well before 1 minute. I tried a few alternate commands to experiment, and I can see in the stdout for the task that echo testing123 echo testing456” prints to stdout correctly, both testing123 and testing456 however: echo testing123a sleep 10 echo testing456a” prints only testing123a, presumably because the container is lost and destroyed before the sleep time is up. So it’s like the container for the executor is only allowed to run for .5 seconds, then it is detected as exited, and the task is lost. Thanks for any advice. Tyson slave logs look like: mesosslave_1 | I0417 19:07:27.46123011 slave.cpp:1121] Got assigned task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.46147911 slave.cpp:1231] Launching task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.46325011 slave.cpp:4160] Launching executor insights-1-1429297638 of framework 20150417-190611-2801799596-5050-1- in work directory '/tmp/mesos/slaves/20150417-190611-2801799596-5050-1-S0/frameworks/20150417-190611-2801799596-5050-1-/executors/insights-1-1429297638/runs/6539127f-9dbb-425b-86a8-845b748f0cd3' mesosslave_1 | I0417 19:07:27.46344411 slave.cpp:1378] Queuing task 'mesos-slave1.service.consul-31000' for executor insights-1-1429297638 of framework '20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.467200 7 docker.cpp:755] Starting container '6539127f-9dbb-425b-86a8-845b748f0cd3' for executor 'insights-1-1429297638' and framework '20150417-190611-2801799596-5050-1-' mesosslave_1 | I0417 19:07:27.985935 7 docker.cpp:1333] Executor for container '6539127f-9dbb-425b-86a8-845b748f0cd3' has exited mesosslave_1 | I0417 19:07:27.986359 7 docker.cpp:1159] Destroying
Re: docker based executor
You can reproduce with most any dockerfile, I think - it seems like launching a customer executor that is a docker container has some problem. I just made a simple test with docker file: -- #this is oracle java8 atop phusion baseimage FROM opentable/baseimage-java8:latest #mesos lib (not used here, but will be in our “real” executor, e.g. to register the executor etc) RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E56151BF RUN echo deb http://repos.mesosphere.io/$(lsb_release -is | tr '[:upper:]' '[:lower:]') $(lsb_release -cs) main | tee /etc/apt/sources.list.d/mesosphere.list RUN cat /etc/apt/sources.list.d/mesosphere.list RUN apt-get update apt-get install -y \ mesos ADD script.sh /usr/bin/executor-script.sh CMD executor-script.sh -- and script.sh: -- #!/bin/bash until false; do echo waiting for something to do something sleep 0.2 done -- And in my stdout I get exactly 2 lines: waiting for something to do something waiting for something to do something Which is how many lines can be output in within 0.5 seconds…something is fishy about the 0.5 seconds, but I’m not sure where. I’m not sure exactly the difference, but launching a docker container as a task WITHOUT a custom executor works fine, and I’m not sure about launching a docker container as a task that is using a non-docker custom executor. The case I’m trying for is using a docker customer executor, and launching non-docker tasks. (in case that helps clarify the situation). Thanks Tyson On Apr 17, 2015, at 1:47 PM, Jason Giedymin jason.giedy...@gmail.commailto:jason.giedy...@gmail.com wrote: Try: until something; do echo waiting for something to do something sleep 5 done You can put this in a bash file and run that. If you have a dockerfile would be easier to debug. -Jason On Apr 17, 2015, at 4:24 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Yes, agreed that the command should not exit - but the container is killed at around 0.5 s after launch regardless of whether the command terminates, which is why I’ve been experimenting using commands with varied exit times. For example, forget about the executor needing to register momentarily. Using the command: echo testing123c sleep 0.1 echo testing456c - I see the expected output in stdout, and the container is destroyed (as expected), because the container exits quickly, and then is destroyed Using the command: echo testing123d sleep 0.6 echo testing456d - I do NOT see the expected output in stdout (I only get testing123d), because the container is destroyed prematurely after ~0.5 seconds Using the “real” storm command, I get no output in stdout, probably because no output is generated within 0.5 seconds of launch - it is a bit of a pig to startup, so I’m currently just trying to execute some other commands for testing purposes. So I’m guessing this is a timeout issue, or else that the container is reaped inappropriately, or something else… looking through this code, I’m trying to figure out the steps take during executor launch: https://github.com/apache/mesos/blob/00318fc1b30fc0961c2dfa4d934c37866577d801/src/slave/containerizer/docker.cpp#L715 Thanks Tyson On Apr 17, 2015, at 12:53 PM, Jason Giedymin jason.giedy...@gmail.commailto:jason.giedy...@gmail.com wrote: What is the last command you have docker doing? If that command exits then the docker will begin to end the container. -Jason On Apr 17, 2015, at 3:23 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi - I am looking at revving the mesos-storm framework to be dockerized (and simpler). I’m using mesos 0.22.0-1.0.ubuntu1404 mesos master + mesos slave are deployed in docker containers, in case it matters. I have the storm (nimbus) framework launching fine as a docker container, but launching tasks for a topology is having problems related to using a docker-based executor. For example. TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(mesos-storm”))) .setCommand(CommandInfo.newBuilder().setShell(true).setValue(storm supervisor storm.mesos.MesosSupervisor)) //rest is unchanged from existing mesos-storm framework code The executor launches and exits quickly - see the log msg: Executor for container
Re: docker based executor
, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) time=2015-04-18T04:26:31Z level=info msg=-job log(start, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) = OK (0) time=2015-04-18T04:26:31Z level=debug msg=Calling GET /containers/{name:.*}/json time=2015-04-18T04:26:31Z level=info msg=GET /containers/4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4/json time=2015-04-18T04:26:31Z level=info msg=+job container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) time=2015-04-18T04:26:32Z level=info msg=-job start(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) = OK (0) time=2015-04-18T04:26:32Z level=info msg=-job container_inspect(4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4) = OK (0) time=2015-04-18T04:26:32Z level=debug msg=Calling GET /containers/{name:.*}/json time=2015-04-18T04:26:32Z level=info msg=GET /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json time=2015-04-18T04:26:32Z level=info msg=+job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) time=2015-04-18T04:26:32Z level=info msg=-job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0) time=2015-04-18T04:26:32Z level=debug msg=Calling GET /containers/{name:.*}/json time=2015-04-18T04:26:32Z level=info msg=GET /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/json time=2015-04-18T04:26:32Z level=info msg=+job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) time=2015-04-18T04:26:32Z level=info msg=-job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0) time=2015-04-18T04:26:32Z level=debug msg=Calling POST /containers/{name:.*}/wait time=2015-04-18T04:26:32Z level=info msg=POST /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/wait time=2015-04-18T04:26:32Z level=info msg=+job wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) time=2015-04-18T04:26:32Z level=debug msg=Calling GET /containers/{name:.*}/logs time=2015-04-18T04:26:32Z level=info msg=GET /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/logs?follow=1stderr=1stdout=1tail=all time=2015-04-18T04:26:32Z level=info msg=+job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) time=2015-04-18T04:26:32Z level=info msg=-job container_inspect(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0) time=2015-04-18T04:26:32Z level=info msg=+job logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) time=2015-04-18T04:26:32Z level=debug msg=Calling POST /containers/{name:.*}/stop time=2015-04-18T04:26:32Z level=info msg=POST /v1.18/containers/mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a/stop?t=0 time=2015-04-18T04:26:32Z level=info msg=+job stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) time=2015-04-18T04:26:32Z level=debug msg=Sending 15 to 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4 time=2015-04-18T04:26:32Z level=info msg=Container 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4 failed to exit within 0 seconds of SIGTERM - using the force time=2015-04-18T04:26:32Z level=debug msg=Sending 9 to 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4 time=2015-04-18T04:26:32Z level=info msg=+job log(die, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) time=2015-04-18T04:26:32Z level=info msg=-job log(die, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) = OK (0) time=2015-04-18T04:26:32Z level=info msg=-job logs(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0) time=2015-04-18T04:26:32Z level=info msg=-job wait(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0) time=2015-04-18T04:26:32Z level=info msg=+job log(stop, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) time=2015-04-18T04:26:32Z level=info msg=-job log(stop, 4e8320cb2a8e4ede5fb5ae386866addfe008c0035397fe44b84f401e959f96f4, testexecutor:latest) = OK (0) time=2015-04-18T04:26:32Z level=info msg=-job stop(mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a) = OK (0)” I don’t see a syslog for the master/slave containers Thanks Tyson On Apr 17, 2015, at 7:07 PM, Jason Giedymin jason.giedy...@gmail.commailto:jason.giedy...@gmail.com wrote: What do any/all logs say? (syslog) -Jason On Apr 17, 2015, at 7:22 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: another interesting fact: I can restart the docker container of my executor, and it runs great. In the test example below, notice the stdout appears to be growing as expected after restarting the container. So something is killing my executor container (also indicated by the Exited (137) About a minute ago”), but I’m still not sure what. Thanks Tyson tnorris-osx:insights tnorris$ docker ps -a | grep testexec 5291fe29c9c2testexecutor:latest /bin/sh -c executor About
Re: docker based executor
Hi Erik - Yes these sound like good changes - I am currently focused on just trying to strip things down to be simpler for building versions etc. Specifically I’ve been working on: - don’t distribute config via embedded http server, just send the settings via command args, e.g. -c mesos.master.url=zk://zk1.service.consul:2181/mesos -c storm.zookeeper.servers=[\zk1.service.consul\”] - use docker to ease framework+executor distribution (instead of repacking a storm tarball?) - single container that has storm installation + overlayed lib dir with meson-storm.jar, run it just like storm script: docker run mesos-storm supervisor storm.mesos.MesosSupervisor (use the same container for supervisor executor + nimbus framework container) Currently I stuck on this problem of the executor container dying without any indication why. I only know that it runs whatever container I specify for the executor approx half a second, and then it dies. Tried different containers, and different variants of shell true/false, etc. I haven’t been able to find any examples of running a container as executor, so while it seems like it would make things simpler, its not that way yet. I will be happy to participate in refactoring, feel free to email me offlist. Thanks Tyson On Apr 17, 2015, at 9:18 PM, Erik Weathers eweath...@groupon.commailto:eweath...@groupon.com wrote: hey Tyson, I've also worked a bit on improving simplifying the mesos-storm framework -- spent the recent Mesosphere hackathon working with tnachen of Mesosphere on this. Nothing deliverable quite yet. We didn't look at dockerization at all, the hacking we did was around these goals: * Avoiding the greedy hoarding of Offers done by the mesos-storm framework (ditching RotatingMap, and only hoarding Offers when there are topologies that need storm worker slots). * Allowing the Mesos UI to distinguish the topologies, by having the Mesos tasks be dedicated to a topology. * Adding usable logging in MesosNimbus. (Some of this work should be usable by other Mesos frameworks, since I'm pretty-printing the Mesos protobuf objects in 1-line JSON instead of bazillion line protobuf toString() pseudo-JSON output. Would be nice to create a library out of it.) Would you like to participate in an offline thread mesos-storm refactoring? Thanks! - Erik On Fri, Apr 17, 2015 at 12:23 PM, Tyson Norris tnor...@adobe.commailto:tnor...@adobe.com wrote: Hi - I am looking at revving the mesos-storm framework to be dockerized (and simpler). I’m using mesos 0.22.0-1.0.ubuntu1404 mesos master + mesos slave are deployed in docker containers, in case it matters. I have the storm (nimbus) framework launching fine as a docker container, but launching tasks for a topology is having problems related to using a docker-based executor. For example. TaskInfo task = TaskInfo.newBuilder() .setName(worker + slot.getNodeId() + : + slot.getPort()) .setTaskId(taskId) .setSlaveId(offer.getSlaveId()) .setExecutor(ExecutorInfo.newBuilder() .setExecutorId(ExecutorID.newBuilder().setValue(details.getId())) .setData(ByteString.copyFromUtf8(executorDataStr)) .setContainer(ContainerInfo.newBuilder() .setType(ContainerInfo.Type.DOCKER) .setDocker(ContainerInfo.DockerInfo.newBuilder() .setImage(mesos-storm”))) .setCommand(CommandInfo.newBuilder().setShell(true).setValue(storm supervisor storm.mesos.MesosSupervisor)) //rest is unchanged from existing mesos-storm framework code The executor launches and exits quickly - see the log msg: Executor for container '88ce3658-7d9c-4b5f-b69a-cb5e48125dfd' has exited It seems like mesos loses track of the executor? I understand there is a 1 min timeout on registering the executor, but the exit happens well before 1 minute. I tried a few alternate commands to experiment, and I can see in the stdout for the task that echo testing123 echo testing456” prints to stdout correctly, both testing123 and testing456 however: echo testing123a sleep 10 echo testing456a” prints only testing123a, presumably because the container is lost and destroyed before the sleep time is up. So it’s like the container for the executor is only allowed to run for .5 seconds, then it is detected as exited, and the task is lost. Thanks for any advice. Tyson slave logs look like: mesosslave_1 | I0417 19:07:27.46123011 slave.cpp:1121] Got assigned task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.46147911 slave.cpp:1231] Launching task mesos-slave1.service.consul-31000 for framework 20150417-190611-2801799596-5050-1- mesosslave_1 | I0417 19:07:27.46325011 slave.cpp:4160] Launching executor insights-1-1429297638 of framework 20150417-190611