I'm seeing an issue where tasks are being marked as killed but remain
running. The tasks all run via the native Docker containerizer and are
started from Marathon.
The net result is additional, orphaned Docker containers that must be
stopped/removed manually.
Versions:
- Mesos 0.20.1
- Marathon 0.7.1
- Docker 1.2.0
- Ubuntu 14.04
Environment:
- 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate instances)
on EC2
Here's the task in the Mesos UI:
[image: Inline image 1]
(note that stderr continues to update with the latest container output)
Here's the still-running Docker container:
$ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
3d451b8213ea
docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0
"\"/bin/sh -c 'java 26 minutes ago Up 26 minutes
9990/tcp mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
Here are the Mesos logs associated with the task:
$ grep eda431d7-4a74-11e4-a320-56847afe9799 /var/log/mesos/mesos-slave.INFO
I1002 20:44:39.176024 1528 slave.cpp:1002] Got assigned task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.176257 1528 slave.cpp:1112] Launching task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.177287 1528 slave.cpp:1222] Queuing task
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
'20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.191769 1528 docker.cpp:743] Starting container
'1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor
'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework
'20140919-224934-1593967114-5050-1518-0000'
I1002 20:44:43.707033 1521 slave.cpp:1278] Asked to kill task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:43.707811 1521 slave.cpp:2088] Handling status update
TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0
W1002 20:44:43.708273 1521 slave.cpp:1354] Killing the unregistered
executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
20140919-224934-1593967114-5050-1518-0000 because it has no tasks
E1002 20:44:43.708375 1521 slave.cpp:2205] Failed to update resources for
container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor
serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for
terminal task, destroying container: No container found
I1002 20:44:43.708524 1521 status_update_manager.cpp:320] Received status
update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:43.708709 1521 status_update_manager.cpp:373] Forwarding
status update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for
task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000 to [email protected]:5050
I1002 20:44:43.728991 1526 status_update_manager.cpp:398] Received status
update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for
task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:47:05.904324 1527 slave.cpp:2538] Monitoring executor
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
'20140919-224934-1593967114-5050-1518-0000' in container
'1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f'
I1002 20:47:06.311027 1525 slave.cpp:1733] Got registration for executor
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
20140919-224934-1593967114-5050-1518-0000 from executor(1)@10.2.1.34:29920
I'll typically see a barrage of these in association with a Marathon app
update (which deploys new tasks). Eventually, one container "sticks" and we
get a RUNNING task instead of a KILLED one.
Where else can I look?