One thing to check - have you upped

--executor_registration_timeout

from the default of 1min? a docker pull can easily take longer than that.

On 2 October 2014 22:18, Michael Babineau <[email protected]> wrote:
> I'm seeing an issue where tasks are being marked as killed but remain
> running. The tasks all run via the native Docker containerizer and are
> started from Marathon.
>
> The net result is additional, orphaned Docker containers that must be
> stopped/removed manually.
>
> Versions:
> - Mesos 0.20.1
> - Marathon 0.7.1
> - Docker 1.2.0
> - Ubuntu 14.04
>
> Environment:
> - 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate instances)
> on EC2
>
> Here's the task in the Mesos UI:
>
> (note that stderr continues to update with the latest container output)
>
> Here's the still-running Docker container:
> $ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
> 3d451b8213ea
> docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0
> "\"/bin/sh -c 'java    26 minutes ago      Up 26 minutes       9990/tcp
> mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
>
> Here are the Mesos logs associated with the task:
> $ grep eda431d7-4a74-11e4-a320-56847afe9799 /var/log/mesos/mesos-slave.INFO
> I1002 20:44:39.176024  1528 slave.cpp:1002] Got assigned task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
> 20140919-224934-1593967114-5050-1518-0000
> I1002 20:44:39.176257  1528 slave.cpp:1112] Launching task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
> 20140919-224934-1593967114-5050-1518-0000
> I1002 20:44:39.177287  1528 slave.cpp:1222] Queuing task
> 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
> '20140919-224934-1593967114-5050-1518-0000
> I1002 20:44:39.191769  1528 docker.cpp:743] Starting container
> '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task
> 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor
> 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework
> '20140919-224934-1593967114-5050-1518-0000'
> I1002 20:44:43.707033  1521 slave.cpp:1278] Asked to kill task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
> 20140919-224934-1593967114-5050-1518-0000
> I1002 20:44:43.707811  1521 slave.cpp:2088] Handling status update
> TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
> 20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0
> W1002 20:44:43.708273  1521 slave.cpp:1354] Killing the unregistered
> executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
> 20140919-224934-1593967114-5050-1518-0000 because it has no tasks
> E1002 20:44:43.708375  1521 slave.cpp:2205] Failed to update resources for
> container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for
> terminal task, destroying container: No container found
> I1002 20:44:43.708524  1521 status_update_manager.cpp:320] Received status
> update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
> 20140919-224934-1593967114-5050-1518-0000
> I1002 20:44:43.708709  1521 status_update_manager.cpp:373] Forwarding status
> update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
> 20140919-224934-1593967114-5050-1518-0000 to [email protected]:5050
> I1002 20:44:43.728991  1526 status_update_manager.cpp:398] Received status
> update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
> serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
> 20140919-224934-1593967114-5050-1518-0000
> I1002 20:47:05.904324  1527 slave.cpp:2538] Monitoring executor
> 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
> '20140919-224934-1593967114-5050-1518-0000' in container
> '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f'
> I1002 20:47:06.311027  1525 slave.cpp:1733] Got registration for executor
> 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
> 20140919-224934-1593967114-5050-1518-0000 from executor(1)@10.2.1.34:29920
>
> I'll typically see a barrage of these in association with a Marathon app
> update (which deploys new tasks). Eventually, one container "sticks" and we
> get a RUNNING task instead of a KILLED one.
>
> Where else can I look?

Reply via email to