Re: Staging docker task KILLED after 1 minute

Tim Chen Fri, 17 Oct 2014 15:31:59 -0700

The case where Mesos loses track about these killed containers is going to
be fixed soon, have a reviewboard up and once it merged we shouldn't have
untracked containers.


Tim

On Fri, Oct 17, 2014 at 3:14 PM, Dick Davies <[email protected]> wrote:

> good catch! Sorry, the docs are right I just had a brain fart :)
>
> On 17 October 2014 13:46, Nils De Moor <[email protected]> wrote:
> > Hi guys,
> >
> > Thanks for the swift feedback. I can confirm that tweaking the
> > task_launch_timeout setting in marathon and setting it to a value bigger
> > that the executor_registration_timeout setting in mesos fixed our
> problem.
> >
> > One sidenote though: the task_launch_timeout setting is in
> milli-seconds, so
> > for 5 minutes it's 300000 (vs 300 in seconds).
> > It will save you some hair pulling when seeing your tasks being killed
> > immediately after being launched. ;)
> >
> > Thanks again!
> >
> > Kr,
> > Nils
> >
> > On Thu, Oct 16, 2014 at 4:27 PM, Michael Babineau
> > <[email protected]> wrote:
> >>
> >> See also https://issues.apache.org/jira/browse/MESOS-1915
> >>
> >> On Thu, Oct 16, 2014 at 2:59 AM, Dick Davies <[email protected]>
> >> wrote:
> >>>
> >>> One gotcha - the marathon timeout is in seconds, so pass '300' in your
> >>> case.
> >>>
> >>> let us know if it works, I spotted this the other day and anecdotally
> >>> it addresses
> >>> the issue for some users, be good to get more feedback.
> >>>
> >>> On 16 October 2014 09:49, Grzegorz Graczyk <[email protected]>
> wrote:
> >>> > Make sure you have --task_launch_timeout in marathon set to same
> value
> >>> > as
> >>> > executor_registration_timeout.
> >>> >
> >>> >
> https://github.com/mesosphere/marathon/blob/master/docs/docs/native-docker.md#configure-marathon
> >>> >
> >>> > On 16 October 2014 10:37, Nils De Moor <[email protected]>
> wrote:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> Environment:
> >>> >> - Clean vagrant install, 1 master, 1 slave (same behaviour on
> >>> >> production
> >>> >> cluster with 3 masters, 6 slaves)
> >>> >> - Mesos 0.20.1
> >>> >> - Marathon 0.7.3
> >>> >> - Docker 1.2.0
> >>> >>
> >>> >> Slave config:
> >>> >> - containerizers: "docker,mesos"
> >>> >> - executor_registration_timeout: 5mins
> >>> >>
> >>> >> When is start docker container tasks, they start being pulled from
> the
> >>> >> HUB, but after 1 minute mesos kills them.
> >>> >> In the background though the pull is still finishing and when
> >>> >> everything
> >>> >> is pulled in the docker container is started, without mesos knowing
> >>> >> about
> >>> >> it.
> >>> >> When I start the same task in mesos again (after I know the pull of
> >>> >> the
> >>> >> image is done), they run normally.
> >>> >>
> >>> >> So this leaves slaves with 'dirty' docker containers, as mesos has
> no
> >>> >> knowledge about them.
> >>> >>
> >>> >> From the logs I get this:
> >>> >> ---
> >>> >> I1009 15:30:02.990291  1414 slave.cpp:1002] Got assigned task
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a for framework
> >>> >> 20140904-160348-185204746-5050-27588-0000
> >>> >> I1009 15:30:02.990979  1414 slave.cpp:1112] Launching task
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a for framework
> >>> >> 20140904-160348-185204746-5050-27588-0000
> >>> >> I1009 15:30:02.993341  1414 slave.cpp:1222] Queuing task
> >>> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a' for executor
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
> >>> >> '20140904-160348-185204746-5050-27588-0000
> >>> >> I1009 15:30:02.995818  1409 docker.cpp:743] Starting container
> >>> >> '25ac3310-71e4-4d10-8a4b-38add4537308' for task
> >>> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a' (and executor
> >>> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a') of framework
> >>> >> '20140904-160348-185204746-5050-27588-0000'
> >>> >>
> >>> >> I1009 15:31:07.033287  1413 slave.cpp:1278] Asked to kill task
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
> >>> >> 20140904-160348-185204746-5050-27588-0000
> >>> >> I1009 15:31:07.034742  1413 slave.cpp:2088] Handling status update
> >>> >> TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
> >>> >> 20140904-160348-185204746-5050-27588-0000 from @0.0.0.0:0
> >>> >> W1009 15:31:07.034881  1413 slave.cpp:1354] Killing the unregistered
> >>> >> executor 'test-app.23755452-4fc9-11e4-839b-080027c4337a' of
> framework
> >>> >> 20140904-160348-185204746-5050-27588-0000 because it has no tasks
> >>> >> E1009 15:31:07.034945  1413 slave.cpp:2205] Failed to update
> resources
> >>> >> for
> >>> >> container 25ac3310-71e4-4d10-8a4b-38add4537308 of executor
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a running task
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a on status update for
> >>> >> terminal
> >>> >> task, destroying container: No container found
> >>> >> I1009 15:31:07.035133  1413 status_update_manager.cpp:320] Received
> >>> >> status
> >>> >> update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for
> >>> >> task
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
> >>> >> 20140904-160348-185204746-5050-27588-0000
> >>> >> I1009 15:31:07.035210  1413 status_update_manager.cpp:373]
> Forwarding
> >>> >> status update TASK_KILLED (UUID:
> a8ec88a1-1809-4108-b2ed-056a725ecd41)
> >>> >> for
> >>> >> task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
> >>> >> 20140904-160348-185204746-5050-27588-0000 to [email protected]:5050
> >>> >> I1009 15:31:07.046167  1408 status_update_manager.cpp:398] Received
> >>> >> status
> >>> >> update acknowledgement (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41)
> >>> >> for task
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
> >>> >> 20140904-160348-185204746-5050-27588-0000
> >>> >>
> >>> >> I1009 15:35:02.993736  1414 slave.cpp:3010] Terminating executor
> >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
> >>> >> 20140904-160348-185204746-5050-27588-0000 because it did not
> register
> >>> >> within
> >>> >> 5mins
> >>> >> ---
> >>> >>
> >>> >> I already posted my question on the marathon board, as I first
> thought
> >>> >> it
> >>> >> was an issue on marathon's end:
> >>> >>
> https://groups.google.com/forum/#!topic/marathon-framework/NT7_YIZnNoY
> >>> >>
> >>> >>
> >>> >> Kind regards,
> >>> >> Nils
> >>> >>
> >>> >
> >>
> >>
> >
>

Re: Staging docker task KILLED after 1 minute

Reply via email to