The case where Mesos loses track about these killed containers is going to be fixed soon, have a reviewboard up and once it merged we shouldn't have untracked containers.
Tim On Fri, Oct 17, 2014 at 3:14 PM, Dick Davies <[email protected]> wrote: > good catch! Sorry, the docs are right I just had a brain fart :) > > On 17 October 2014 13:46, Nils De Moor <[email protected]> wrote: > > Hi guys, > > > > Thanks for the swift feedback. I can confirm that tweaking the > > task_launch_timeout setting in marathon and setting it to a value bigger > > that the executor_registration_timeout setting in mesos fixed our > problem. > > > > One sidenote though: the task_launch_timeout setting is in > milli-seconds, so > > for 5 minutes it's 300000 (vs 300 in seconds). > > It will save you some hair pulling when seeing your tasks being killed > > immediately after being launched. ;) > > > > Thanks again! > > > > Kr, > > Nils > > > > On Thu, Oct 16, 2014 at 4:27 PM, Michael Babineau > > <[email protected]> wrote: > >> > >> See also https://issues.apache.org/jira/browse/MESOS-1915 > >> > >> On Thu, Oct 16, 2014 at 2:59 AM, Dick Davies <[email protected]> > >> wrote: > >>> > >>> One gotcha - the marathon timeout is in seconds, so pass '300' in your > >>> case. > >>> > >>> let us know if it works, I spotted this the other day and anecdotally > >>> it addresses > >>> the issue for some users, be good to get more feedback. > >>> > >>> On 16 October 2014 09:49, Grzegorz Graczyk <[email protected]> > wrote: > >>> > Make sure you have --task_launch_timeout in marathon set to same > value > >>> > as > >>> > executor_registration_timeout. > >>> > > >>> > > https://github.com/mesosphere/marathon/blob/master/docs/docs/native-docker.md#configure-marathon > >>> > > >>> > On 16 October 2014 10:37, Nils De Moor <[email protected]> > wrote: > >>> >> > >>> >> Hi, > >>> >> > >>> >> Environment: > >>> >> - Clean vagrant install, 1 master, 1 slave (same behaviour on > >>> >> production > >>> >> cluster with 3 masters, 6 slaves) > >>> >> - Mesos 0.20.1 > >>> >> - Marathon 0.7.3 > >>> >> - Docker 1.2.0 > >>> >> > >>> >> Slave config: > >>> >> - containerizers: "docker,mesos" > >>> >> - executor_registration_timeout: 5mins > >>> >> > >>> >> When is start docker container tasks, they start being pulled from > the > >>> >> HUB, but after 1 minute mesos kills them. > >>> >> In the background though the pull is still finishing and when > >>> >> everything > >>> >> is pulled in the docker container is started, without mesos knowing > >>> >> about > >>> >> it. > >>> >> When I start the same task in mesos again (after I know the pull of > >>> >> the > >>> >> image is done), they run normally. > >>> >> > >>> >> So this leaves slaves with 'dirty' docker containers, as mesos has > no > >>> >> knowledge about them. > >>> >> > >>> >> From the logs I get this: > >>> >> --- > >>> >> I1009 15:30:02.990291 1414 slave.cpp:1002] Got assigned task > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a for framework > >>> >> 20140904-160348-185204746-5050-27588-0000 > >>> >> I1009 15:30:02.990979 1414 slave.cpp:1112] Launching task > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a for framework > >>> >> 20140904-160348-185204746-5050-27588-0000 > >>> >> I1009 15:30:02.993341 1414 slave.cpp:1222] Queuing task > >>> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a' for executor > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework > >>> >> '20140904-160348-185204746-5050-27588-0000 > >>> >> I1009 15:30:02.995818 1409 docker.cpp:743] Starting container > >>> >> '25ac3310-71e4-4d10-8a4b-38add4537308' for task > >>> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a' (and executor > >>> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a') of framework > >>> >> '20140904-160348-185204746-5050-27588-0000' > >>> >> > >>> >> I1009 15:31:07.033287 1413 slave.cpp:1278] Asked to kill task > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework > >>> >> 20140904-160348-185204746-5050-27588-0000 > >>> >> I1009 15:31:07.034742 1413 slave.cpp:2088] Handling status update > >>> >> TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework > >>> >> 20140904-160348-185204746-5050-27588-0000 from @0.0.0.0:0 > >>> >> W1009 15:31:07.034881 1413 slave.cpp:1354] Killing the unregistered > >>> >> executor 'test-app.23755452-4fc9-11e4-839b-080027c4337a' of > framework > >>> >> 20140904-160348-185204746-5050-27588-0000 because it has no tasks > >>> >> E1009 15:31:07.034945 1413 slave.cpp:2205] Failed to update > resources > >>> >> for > >>> >> container 25ac3310-71e4-4d10-8a4b-38add4537308 of executor > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a running task > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a on status update for > >>> >> terminal > >>> >> task, destroying container: No container found > >>> >> I1009 15:31:07.035133 1413 status_update_manager.cpp:320] Received > >>> >> status > >>> >> update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for > >>> >> task > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework > >>> >> 20140904-160348-185204746-5050-27588-0000 > >>> >> I1009 15:31:07.035210 1413 status_update_manager.cpp:373] > Forwarding > >>> >> status update TASK_KILLED (UUID: > a8ec88a1-1809-4108-b2ed-056a725ecd41) > >>> >> for > >>> >> task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework > >>> >> 20140904-160348-185204746-5050-27588-0000 to [email protected]:5050 > >>> >> I1009 15:31:07.046167 1408 status_update_manager.cpp:398] Received > >>> >> status > >>> >> update acknowledgement (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) > >>> >> for task > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework > >>> >> 20140904-160348-185204746-5050-27588-0000 > >>> >> > >>> >> I1009 15:35:02.993736 1414 slave.cpp:3010] Terminating executor > >>> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework > >>> >> 20140904-160348-185204746-5050-27588-0000 because it did not > register > >>> >> within > >>> >> 5mins > >>> >> --- > >>> >> > >>> >> I already posted my question on the marathon board, as I first > thought > >>> >> it > >>> >> was an issue on marathon's end: > >>> >> > https://groups.google.com/forum/#!topic/marathon-framework/NT7_YIZnNoY > >>> >> > >>> >> > >>> >> Kind regards, > >>> >> Nils > >>> >> > >>> > > >> > >> > > >

