Supporting the registration timeout theory, logs for this example container confirm it didn't actually start until several minutes after Mesos had marked the task as killed.
On Thu, Oct 2, 2014 at 2:29 PM, Michael Babineau <[email protected] > wrote: > Thanks, I just had the same thought > > I'm injecting it via environment variable: > MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins > > but I don't know how to check that the setting took > > > On Thu, Oct 2, 2014 at 2:24 PM, Dick Davies <[email protected]> > wrote: > >> One thing to check - have you upped >> >> --executor_registration_timeout >> >> from the default of 1min? a docker pull can easily take longer than that. >> >> On 2 October 2014 22:18, Michael Babineau <[email protected]> >> wrote: >> > I'm seeing an issue where tasks are being marked as killed but remain >> > running. The tasks all run via the native Docker containerizer and are >> > started from Marathon. >> > >> > The net result is additional, orphaned Docker containers that must be >> > stopped/removed manually. >> > >> > Versions: >> > - Mesos 0.20.1 >> > - Marathon 0.7.1 >> > - Docker 1.2.0 >> > - Ubuntu 14.04 >> > >> > Environment: >> > - 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate >> instances) >> > on EC2 >> > >> > Here's the task in the Mesos UI: >> > >> > (note that stderr continues to update with the latest container output) >> > >> > Here's the still-running Docker container: >> > $ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f >> > 3d451b8213ea >> > >> docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0 >> > "\"/bin/sh -c 'java 26 minutes ago Up 26 minutes 9990/tcp >> > mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f >> > >> > Here are the Mesos logs associated with the task: >> > $ grep eda431d7-4a74-11e4-a320-56847afe9799 >> /var/log/mesos/mesos-slave.INFO >> > I1002 20:44:39.176024 1528 slave.cpp:1002] Got assigned task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework >> > 20140919-224934-1593967114-5050-1518-0000 >> > I1002 20:44:39.176257 1528 slave.cpp:1112] Launching task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework >> > 20140919-224934-1593967114-5050-1518-0000 >> > I1002 20:44:39.177287 1528 slave.cpp:1222] Queuing task >> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >> > '20140919-224934-1593967114-5050-1518-0000 >> > I1002 20:44:39.191769 1528 docker.cpp:743] Starting container >> > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task >> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor >> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework >> > '20140919-224934-1593967114-5050-1518-0000' >> > I1002 20:44:43.707033 1521 slave.cpp:1278] Asked to kill task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >> > 20140919-224934-1593967114-5050-1518-0000 >> > I1002 20:44:43.707811 1521 slave.cpp:2088] Handling status update >> > TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >> > 20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0 >> > W1002 20:44:43.708273 1521 slave.cpp:1354] Killing the unregistered >> > executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of >> framework >> > 20140919-224934-1593967114-5050-1518-0000 because it has no tasks >> > E1002 20:44:43.708375 1521 slave.cpp:2205] Failed to update resources >> for >> > container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for >> > terminal task, destroying container: No container found >> > I1002 20:44:43.708524 1521 status_update_manager.cpp:320] Received >> status >> > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >> > 20140919-224934-1593967114-5050-1518-0000 >> > I1002 20:44:43.708709 1521 status_update_manager.cpp:373] Forwarding >> status >> > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >> > 20140919-224934-1593967114-5050-1518-0000 to [email protected]:5050 >> > I1002 20:44:43.728991 1526 status_update_manager.cpp:398] Received >> status >> > update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for >> task >> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >> > 20140919-224934-1593967114-5050-1518-0000 >> > I1002 20:47:05.904324 1527 slave.cpp:2538] Monitoring executor >> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework >> > '20140919-224934-1593967114-5050-1518-0000' in container >> > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' >> > I1002 20:47:06.311027 1525 slave.cpp:1733] Got registration for >> executor >> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework >> > 20140919-224934-1593967114-5050-1518-0000 from executor(1)@ >> 10.2.1.34:29920 >> > >> > I'll typically see a barrage of these in association with a Marathon app >> > update (which deploys new tasks). Eventually, one container "sticks" >> and we >> > get a RUNNING task instead of a KILLED one. >> > >> > Where else can I look? >> > >

