Okay, I don't think the issue is with the executor registration timeout. The timeout parameter is being passed correctly, and there is only a 4 second delay between task start and task kill: I1002 *20:44:39.176024* 1528 slave.cpp:1002] Got assigned task serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework 20140919-224934-1593967114-5050-1518-0000 I1002 20:44:39.176257 1528 slave.cpp:1112] Launching task serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework 20140919-224934-1593967114-5050-1518-0000 I1002 20:44:39.177287 1528 slave.cpp:1222] Queuing task 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework '20140919-224934-1593967114-5050-1518-0000 I1002 20:44:39.191769 1528 docker.cpp:743] Starting container '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework '20140919-224934-1593967114-5050-1518-0000' I1002 20:44:43.707033 1521 slave.cpp:1278] Asked to kill task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework 20140919-224934-1593967114-5050-1518-0000 I1002 *20:44:43.707811* 1521 slave.cpp:2088] Handling status update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework 20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0
What else could this be? On Thu, Oct 2, 2014 at 2:33 PM, Michael Babineau <[email protected] > wrote: > Supporting the registration timeout theory, logs for this example > container confirm it didn't actually start until several minutes after > Mesos had marked the task as killed. > > On Thu, Oct 2, 2014 at 2:29 PM, Michael Babineau < > [email protected]> wrote: > >> Thanks, I just had the same thought >> >> I'm injecting it via environment variable: >> MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins >> >> but I don't know how to check that the setting took >> >> >> On Thu, Oct 2, 2014 at 2:24 PM, Dick Davies <[email protected]> >> wrote: >> >>> One thing to check - have you upped >>> >>> --executor_registration_timeout >>> >>> from the default of 1min? a docker pull can easily take longer than that. >>> >>> On 2 October 2014 22:18, Michael Babineau <[email protected]> >>> wrote: >>> > I'm seeing an issue where tasks are being marked as killed but remain >>> > running. The tasks all run via the native Docker containerizer and are >>> > started from Marathon. >>> > >>> > The net result is additional, orphaned Docker containers that must be >>> > stopped/removed manually. >>> > >>> > Versions: >>> > - Mesos 0.20.1 >>> > - Marathon 0.7.1 >>> > - Docker 1.2.0 >>> > - Ubuntu 14.04 >>> > >>> > Environment: >>> > - 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate >>> instances) >>> > on EC2 >>> > >>> > Here's the task in the Mesos UI: >>> > >>> > (note that stderr continues to update with the latest container output) >>> > >>> > Here's the still-running Docker container: >>> > $ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f >>> > 3d451b8213ea >>> > >>> docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0 >>> > "\"/bin/sh -c 'java 26 minutes ago Up 26 minutes 9990/tcp >>> > mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f >>> > >>> > Here are the Mesos logs associated with the task: >>> > $ grep eda431d7-4a74-11e4-a320-56847afe9799 >>> /var/log/mesos/mesos-slave.INFO >>> > I1002 20:44:39.176024 1528 slave.cpp:1002] Got assigned task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework >>> > 20140919-224934-1593967114-5050-1518-0000 >>> > I1002 20:44:39.176257 1528 slave.cpp:1112] Launching task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework >>> > 20140919-224934-1593967114-5050-1518-0000 >>> > I1002 20:44:39.177287 1528 slave.cpp:1222] Queuing task >>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >>> > '20140919-224934-1593967114-5050-1518-0000 >>> > I1002 20:44:39.191769 1528 docker.cpp:743] Starting container >>> > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task >>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor >>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework >>> > '20140919-224934-1593967114-5050-1518-0000' >>> > I1002 20:44:43.707033 1521 slave.cpp:1278] Asked to kill task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >>> > 20140919-224934-1593967114-5050-1518-0000 >>> > I1002 20:44:43.707811 1521 slave.cpp:2088] Handling status update >>> > TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >>> > 20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0 >>> > W1002 20:44:43.708273 1521 slave.cpp:1354] Killing the unregistered >>> > executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of >>> framework >>> > 20140919-224934-1593967114-5050-1518-0000 because it has no tasks >>> > E1002 20:44:43.708375 1521 slave.cpp:2205] Failed to update resources >>> for >>> > container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for >>> > terminal task, destroying container: No container found >>> > I1002 20:44:43.708524 1521 status_update_manager.cpp:320] Received >>> status >>> > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for >>> task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >>> > 20140919-224934-1593967114-5050-1518-0000 >>> > I1002 20:44:43.708709 1521 status_update_manager.cpp:373] Forwarding >>> status >>> > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for >>> task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >>> > 20140919-224934-1593967114-5050-1518-0000 to [email protected]:5050 >>> > I1002 20:44:43.728991 1526 status_update_manager.cpp:398] Received >>> status >>> > update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) >>> for task >>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework >>> > 20140919-224934-1593967114-5050-1518-0000 >>> > I1002 20:47:05.904324 1527 slave.cpp:2538] Monitoring executor >>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework >>> > '20140919-224934-1593967114-5050-1518-0000' in container >>> > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' >>> > I1002 20:47:06.311027 1525 slave.cpp:1733] Got registration for >>> executor >>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework >>> > 20140919-224934-1593967114-5050-1518-0000 from executor(1)@ >>> 10.2.1.34:29920 >>> > >>> > I'll typically see a barrage of these in association with a Marathon >>> app >>> > update (which deploys new tasks). Eventually, one container "sticks" >>> and we >>> > get a RUNNING task instead of a KILLED one. >>> > >>> > Where else can I look? >>> >> >> >

