One thing to check - have you upped --executor_registration_timeout
from the default of 1min? a docker pull can easily take longer than that. On 2 October 2014 22:18, Michael Babineau <[email protected]> wrote: > I'm seeing an issue where tasks are being marked as killed but remain > running. The tasks all run via the native Docker containerizer and are > started from Marathon. > > The net result is additional, orphaned Docker containers that must be > stopped/removed manually. > > Versions: > - Mesos 0.20.1 > - Marathon 0.7.1 > - Docker 1.2.0 > - Ubuntu 14.04 > > Environment: > - 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate instances) > on EC2 > > Here's the task in the Mesos UI: > > (note that stderr continues to update with the latest container output) > > Here's the still-running Docker container: > $ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f > 3d451b8213ea > docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0 > "\"/bin/sh -c 'java 26 minutes ago Up 26 minutes 9990/tcp > mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f > > Here are the Mesos logs associated with the task: > $ grep eda431d7-4a74-11e4-a320-56847afe9799 /var/log/mesos/mesos-slave.INFO > I1002 20:44:39.176024 1528 slave.cpp:1002] Got assigned task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework > 20140919-224934-1593967114-5050-1518-0000 > I1002 20:44:39.176257 1528 slave.cpp:1112] Launching task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework > 20140919-224934-1593967114-5050-1518-0000 > I1002 20:44:39.177287 1528 slave.cpp:1222] Queuing task > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework > '20140919-224934-1593967114-5050-1518-0000 > I1002 20:44:39.191769 1528 docker.cpp:743] Starting container > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework > '20140919-224934-1593967114-5050-1518-0000' > I1002 20:44:43.707033 1521 slave.cpp:1278] Asked to kill task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework > 20140919-224934-1593967114-5050-1518-0000 > I1002 20:44:43.707811 1521 slave.cpp:2088] Handling status update > TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework > 20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0 > W1002 20:44:43.708273 1521 slave.cpp:1354] Killing the unregistered > executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework > 20140919-224934-1593967114-5050-1518-0000 because it has no tasks > E1002 20:44:43.708375 1521 slave.cpp:2205] Failed to update resources for > container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor > serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for > terminal task, destroying container: No container found > I1002 20:44:43.708524 1521 status_update_manager.cpp:320] Received status > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework > 20140919-224934-1593967114-5050-1518-0000 > I1002 20:44:43.708709 1521 status_update_manager.cpp:373] Forwarding status > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework > 20140919-224934-1593967114-5050-1518-0000 to [email protected]:5050 > I1002 20:44:43.728991 1526 status_update_manager.cpp:398] Received status > update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework > 20140919-224934-1593967114-5050-1518-0000 > I1002 20:47:05.904324 1527 slave.cpp:2538] Monitoring executor > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework > '20140919-224934-1593967114-5050-1518-0000' in container > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' > I1002 20:47:06.311027 1525 slave.cpp:1733] Got registration for executor > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework > 20140919-224934-1593967114-5050-1518-0000 from executor(1)@10.2.1.34:29920 > > I'll typically see a barrage of these in association with a Marathon app > update (which deploys new tasks). Eventually, one container "sticks" and we > get a RUNNING task instead of a KILLED one. > > Where else can I look?

