That worked, but first I had to destroy marathon's state in zookeeper, as it just kept trying to deploy apps that I was trying to destroy, and no amount of cancelling deployments and scaling down apps worked.
After that, I had to kill all of the docker processes that were still being supervised by the mesos-slaves, and finally, I had to reboot one mesos-slave that still believed one of the docker tasks was running underneath. Once I did all that, I can now deploy a docker app without the whole thing going crazy on me. Thanks! Eduardo On Thu, Oct 23, 2014 at 1:50 AM, Nils De Moor <[email protected]> wrote: > Sorry for that, here is the direct link: > > http://mail-archives.apache.org/mod_mbox/mesos-user/201410.mbox/%3CCALnA0DwbqJdEm3at7SonsXmCwxnN3%3DCUrwgoPBHXJDoOFyJjig%40mail.gmail.com%3E > > On Thu, Oct 23, 2014 at 10:49 AM, Nils De Moor <[email protected]> > wrote: > >> I had the same issue as you did, here is how I fixed it: >> http://mail-archives.apache.org/mod_mbox/mesos-user/201410.mbox/browser >> >> On Thu, Oct 23, 2014 at 1:42 AM, Connor Doyle <[email protected]> >> wrote: >> >>> Hi Eduardo, >>> >>> There is a known defect in Mesos that matches your description: >>> https://issues.apache.org/jira/browse/MESOS-1915 >>> https://issues.apache.org/jira/browse/MESOS-1884 >>> >>> A fix will be included in the next release. >>> https://reviews.apache.org/r/26486 >>> >>> You see the killTask because the default --task_launch_timeout value for >>> Marathon is 60 seconds. >>> Created an issue to make the logging around this better: >>> https://github.com/mesosphere/marathon/issues/732 >>> >>> -- >>> Connor >>> >>> >>> On Oct 22, 2014, at 16:18, Eduardo Jiménez <[email protected]> wrote: >>> >>> > Hi, >>> > >>> > I've started experimenting with mesos using the docker containerizer, >>> and running a simple example got into a very strange state. >>> > >>> > I have mesos-0.20.1, marathon-0.7 setup on EC2, using Amazon Linux: >>> > >>> > Linux <ip> 3.14.20-20.44.amzn1.x86_64 #1 SMP Mon Oct 6 22:52:46 UTC >>> 2014 x86_64 x86_64 x86_64 GNU/Linux >>> > >>> > Docker version 1.2.0, build fa7b24f/1.2.0 >>> > >>> > I start the mesos slave with these relevant options: >>> > >>> > --cgroups_hierarchy=/cgroup >>> > --containerizers=docker,mesos >>> > --executor_registration_timeout=5mins >>> > --isolation=cgroups/cpu,cgroups/mem >>> > >>> > I launched a very simple app, which is from the mesosphere examples: >>> > >>> > { >>> > "container": { >>> > "type": "DOCKER", >>> > "docker": { >>> > "image": "libmesos/ubuntu" >>> > } >>> > }, >>> > "id": "ubuntu-docker2", >>> > "instances": "1", >>> > "cpus": "0.5", >>> > "mem": "512", >>> > "uris": [], >>> > "cmd": "while sleep 10; do date -u +%T; done" >>> > } >>> > >>> > The app launches, but then mesos states the task is KILLED, yet the >>> docker container is STILL running. Here's the sequence of logs from that >>> mesos-slave. >>> > >>> > 1) Task gets created and assigned: >>> > >>> > I1022 17:44:13.971096 15195 slave.cpp:1002] Got assigned task >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 for framework >>> 20141017-172055-3489660938-5050-1603-0000 >>> > I1022 17:44:13.971367 15195 slave.cpp:1112] Launching task >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 for framework >>> 20141017-172055-3489660938-5050-1603-0000 >>> > I1022 17:44:13.973047 15195 slave.cpp:1222] Queuing task >>> 'ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799' for executor >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 of framework >>> '20141017-172055-3489660938-5050-1603-0000 >>> > I1022 17:44:13.989893 15195 docker.cpp:743] Starting container >>> 'c1fc27c8-13e9-484f-a30c-cb062ec4c978' for task >>> 'ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799' (and executor >>> 'ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799') of framework >>> '20141017-172055-3489660938-5050-1603-0000' >>> > >>> > So far so good. The log statements right next to "Starting container" >>> is: >>> > >>> > I1022 17:45:14.893309 15196 slave.cpp:1278] Asked to kill task >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 of framework >>> 20141017-172055-3489660938-5050-1603-0000 >>> > I1022 17:45:14.894579 15196 slave.cpp:2088] Handling status update >>> TASK_KILLED (UUID: 660dfd13-61a0-4e3f-9590-fba0d1a42ab2) for task >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 of framework >>> 20141017-172055-3489660938-5050-1603-0000 from @0.0.0.0:0 >>> > W1022 17:45:14.894798 15196 slave.cpp:1354] Killing the unregistered >>> executor 'ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799' of framework >>> 20141017-172055-3489660938-5050-1603-0000 because it has no tasks >>> > E1022 17:45:14.925014 15192 slave.cpp:2205] Failed to update resources >>> for container c1fc27c8-13e9-484f-a30c-cb062ec4c978 of executor >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 running task >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 on status update for >>> terminal task, destroying container: No container found >>> > >>> > After this, there's several log messages like this: >>> > >>> > I1022 17:45:14.926197 15194 status_update_manager.cpp:320] Received >>> status update TASK_KILLED (UUID: 660dfd13-61a0-4e3f-9590-fba0d1a42ab2) for >>> task ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 of framework >>> 20141017-172055-3489660938-5050-1603-0000 >>> > I1022 17:45:14.926378 15194 status_update_manager.cpp:373] Forwarding >>> status update TASK_KILLED (UUID: 660dfd13-61a0-4e3f-9590-fba0d1a42ab2) for >>> task ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 of framework >>> 20141017-172055-3489660938-5050-1603-0000 to [email protected]:5050 >>> > W1022 17:45:16.169214 15196 status_update_manager.cpp:181] Resending >>> status update TASK_KILLED (UUID: 660dfd13-61a0-4e3f-9590-fba0d1a42ab2) for >>> task ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 of framework >>> 20141017-172055-3489660938-5050-1603-0000 >>> > I1022 17:45:16.169275 15196 status_update_manager.cpp:373] Forwarding >>> status update TASK_KILLED (UUID: 660dfd13-61a0-4e3f-9590-fba0d1a42ab2) for >>> task ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 of framework >>> 20141017-172055-3489660938-5050-1603-0000 to [email protected]:5050 >>> > >>> > >>> > Eventually the TASK_KILLED update is acked and the Mesos UI shows the >>> task as killed. By then, the process should be dead, but its not. >>> > >>> > $ sudo docker ps >>> > CONTAINER ID IMAGE COMMAND >>> CREATED STATUS PORTS NAMES >>> > f76784e1af8b libmesos/ubuntu:latest "/bin/sh -c 'while s 5 >>> hours ago Up 5 hours >>> mesos-c1fc27c8-13e9-484f-a30c-cb062ec4c978 >>> > >>> > >>> > The container shows in the UI like this: >>> > >>> > ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 >>> ubuntu-docker2.0995fb7f-5a13-11e4-a18e-56847afe9799 KILLED 5 hours >>> ago 5 hours ago >>> > And its been running the whole time. >>> > >>> > There's no other logging indicating why killTask was invoked, which >>> makes this extremely frustrating to debug. >>> > >>> > Has anyone seen something similar? >>> > >>> > Thanks, >>> > >>> > Eduardo >>> > >>> >>> >> >

