I'm seeing an issue where tasks are being marked as killed but remain
running. The tasks all run via the native Docker containerizer and are
started from Marathon.

The net result is additional, orphaned Docker containers that must be
stopped/removed manually.

Versions:
- Mesos 0.20.1
- Marathon 0.7.1
- Docker 1.2.0
- Ubuntu 14.04

Environment:
- 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate instances)
on EC2

Here's the task in the Mesos UI:
[image: Inline image 1]
(note that stderr continues to update with the latest container output)

Here's the still-running Docker container:
$ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
3d451b8213ea
docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0
         "\"/bin/sh -c 'java    26 minutes ago      Up 26 minutes
9990/tcp            mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f

Here are the Mesos logs associated with the task:
$ grep eda431d7-4a74-11e4-a320-56847afe9799 /var/log/mesos/mesos-slave.INFO
I1002 20:44:39.176024  1528 slave.cpp:1002] Got assigned task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.176257  1528 slave.cpp:1112] Launching task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.177287  1528 slave.cpp:1222] Queuing task
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
'20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.191769  1528 docker.cpp:743] Starting container
'1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor
'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework
'20140919-224934-1593967114-5050-1518-0000'
I1002 20:44:43.707033  1521 slave.cpp:1278] Asked to kill task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:43.707811  1521 slave.cpp:2088] Handling status update
TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0
W1002 20:44:43.708273  1521 slave.cpp:1354] Killing the unregistered
executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
20140919-224934-1593967114-5050-1518-0000 because it has no tasks
E1002 20:44:43.708375  1521 slave.cpp:2205] Failed to update resources for
container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor
serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for
terminal task, destroying container: No container found
I1002 20:44:43.708524  1521 status_update_manager.cpp:320] Received status
update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:43.708709  1521 status_update_manager.cpp:373] Forwarding
status update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for
task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000 to [email protected]:5050
I1002 20:44:43.728991  1526 status_update_manager.cpp:398] Received status
update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for
task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:47:05.904324  1527 slave.cpp:2538] Monitoring executor
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
'20140919-224934-1593967114-5050-1518-0000' in container
'1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f'
I1002 20:47:06.311027  1525 slave.cpp:1733] Got registration for executor
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
20140919-224934-1593967114-5050-1518-0000 from executor(1)@10.2.1.34:29920

I'll typically see a barrage of these in association with a Marathon app
update (which deploys new tasks). Eventually, one container "sticks" and we
get a RUNNING task instead of a KILLED one.

Where else can I look?

Reply via email to