Okay, I don't think the issue is with the executor registration timeout.
The timeout parameter is being passed correctly, and there is only a 4
second delay between task start and task kill:
I1002 *20:44:39.176024*  1528 slave.cpp:1002] Got assigned task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.176257  1528 slave.cpp:1112] Launching task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.177287  1528 slave.cpp:1222] Queuing task
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
'20140919-224934-1593967114-5050-1518-0000
I1002 20:44:39.191769  1528 docker.cpp:743] Starting container
'1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task
'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor
'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework
'20140919-224934-1593967114-5050-1518-0000'
I1002 20:44:43.707033  1521 slave.cpp:1278] Asked to kill task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000
I1002 *20:44:43.707811*  1521 slave.cpp:2088] Handling status update
TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0

What else could this be?

On Thu, Oct 2, 2014 at 2:33 PM, Michael Babineau <[email protected]
> wrote:

> Supporting the registration timeout theory, logs for this example
> container confirm it didn't actually start until several minutes after
> Mesos had marked the task as killed.
>
> On Thu, Oct 2, 2014 at 2:29 PM, Michael Babineau <
> [email protected]> wrote:
>
>> Thanks, I just had the same thought
>>
>> I'm injecting it via environment variable:
>> MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins
>>
>> but I don't know how to check that the setting took
>>
>>
>> On Thu, Oct 2, 2014 at 2:24 PM, Dick Davies <[email protected]>
>> wrote:
>>
>>> One thing to check - have you upped
>>>
>>> --executor_registration_timeout
>>>
>>> from the default of 1min? a docker pull can easily take longer than that.
>>>
>>> On 2 October 2014 22:18, Michael Babineau <[email protected]>
>>> wrote:
>>> > I'm seeing an issue where tasks are being marked as killed but remain
>>> > running. The tasks all run via the native Docker containerizer and are
>>> > started from Marathon.
>>> >
>>> > The net result is additional, orphaned Docker containers that must be
>>> > stopped/removed manually.
>>> >
>>> > Versions:
>>> > - Mesos 0.20.1
>>> > - Marathon 0.7.1
>>> > - Docker 1.2.0
>>> > - Ubuntu 14.04
>>> >
>>> > Environment:
>>> > - 3 ZK nodes, 3 Mesos Masters, and 3 Mesos Slaves (all separate
>>> instances)
>>> > on EC2
>>> >
>>> > Here's the task in the Mesos UI:
>>> >
>>> > (note that stderr continues to update with the latest container output)
>>> >
>>> > Here's the still-running Docker container:
>>> > $ docker ps|grep 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
>>> > 3d451b8213ea
>>> >
>>> docker.thefactory.com/ace-serialization:f7aa1d4f46f72d52f5a20ef7ae8680e4acf88bc0
>>> > "\"/bin/sh -c 'java    26 minutes ago      Up 26 minutes       9990/tcp
>>> > mesos-1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f
>>> >
>>> > Here are the Mesos logs associated with the task:
>>> > $ grep eda431d7-4a74-11e4-a320-56847afe9799
>>> /var/log/mesos/mesos-slave.INFO
>>> > I1002 20:44:39.176024  1528 slave.cpp:1002] Got assigned task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
>>> > 20140919-224934-1593967114-5050-1518-0000
>>> > I1002 20:44:39.176257  1528 slave.cpp:1112] Launching task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 for framework
>>> > 20140919-224934-1593967114-5050-1518-0000
>>> > I1002 20:44:39.177287  1528 slave.cpp:1222] Queuing task
>>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' for executor
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
>>> > '20140919-224934-1593967114-5050-1518-0000
>>> > I1002 20:44:39.191769  1528 docker.cpp:743] Starting container
>>> > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f' for task
>>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' (and executor
>>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799') of framework
>>> > '20140919-224934-1593967114-5050-1518-0000'
>>> > I1002 20:44:43.707033  1521 slave.cpp:1278] Asked to kill task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
>>> > 20140919-224934-1593967114-5050-1518-0000
>>> > I1002 20:44:43.707811  1521 slave.cpp:2088] Handling status update
>>> > TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
>>> > 20140919-224934-1593967114-5050-1518-0000 from @0.0.0.0:0
>>> > W1002 20:44:43.708273 1521 slave.cpp:1354] Killing the unregistered
>>> > executor 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of
>>> framework
>>> > 20140919-224934-1593967114-5050-1518-0000 because it has no tasks
>>> > E1002 20:44:43.708375  1521 slave.cpp:2205] Failed to update resources
>>> for
>>> > container 1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f of executor
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 running task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 on status update for
>>> > terminal task, destroying container: No container found
>>> > I1002 20:44:43.708524  1521 status_update_manager.cpp:320] Received
>>> status
>>> > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for
>>> task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
>>> > 20140919-224934-1593967114-5050-1518-0000
>>> > I1002 20:44:43.708709  1521 status_update_manager.cpp:373] Forwarding
>>> status
>>> > update TASK_KILLED (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12) for
>>> task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
>>> > 20140919-224934-1593967114-5050-1518-0000 to [email protected]:5050
>>> > I1002 20:44:43.728991  1526 status_update_manager.cpp:398] Received
>>> status
>>> > update acknowledgement (UUID: 4f5bd9f9-0625-43de-81f6-2c3423b1ce12)
>>> for task
>>> > serialization.eda431d7-4a74-11e4-a320-56847afe9799 of framework
>>> > 20140919-224934-1593967114-5050-1518-0000
>>> > I1002 20:47:05.904324  1527 slave.cpp:2538] Monitoring executor
>>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
>>> > '20140919-224934-1593967114-5050-1518-0000' in container
>>> > '1d337fa3-8dd3-4b43-9d1e-a774cbcbc22f'
>>> > I1002 20:47:06.311027  1525 slave.cpp:1733] Got registration for
>>> executor
>>> > 'serialization.eda431d7-4a74-11e4-a320-56847afe9799' of framework
>>> > 20140919-224934-1593967114-5050-1518-0000 from executor(1)@
>>> 10.2.1.34:29920
>>> >
>>> > I'll typically see a barrage of these in association with a Marathon
>>> app
>>> > update (which deploys new tasks). Eventually, one container "sticks"
>>> and we
>>> > get a RUNNING task instead of a KILLED one.
>>> >
>>> > Where else can I look?
>>>
>>
>>
>

Reply via email to