Hi guys, Thanks for the swift feedback. I can confirm that tweaking the task_launch_timeout setting in marathon and setting it to a value bigger that the executor_registration_timeout setting in mesos fixed our problem.
One sidenote though: the task_launch_timeout setting is in milli-seconds, so for 5 minutes it's 300000 (vs 300 in seconds). It will save you some hair pulling when seeing your tasks being killed immediately after being launched. ;) Thanks again! Kr, Nils On Thu, Oct 16, 2014 at 4:27 PM, Michael Babineau < [email protected]> wrote: > See also https://issues.apache.org/jira/browse/MESOS-1915 > > On Thu, Oct 16, 2014 at 2:59 AM, Dick Davies <[email protected]> > wrote: > >> One gotcha - the marathon timeout is in seconds, so pass '300' in your >> case. >> >> let us know if it works, I spotted this the other day and anecdotally >> it addresses >> the issue for some users, be good to get more feedback. >> >> On 16 October 2014 09:49, Grzegorz Graczyk <[email protected]> wrote: >> > Make sure you have --task_launch_timeout in marathon set to same value >> as >> > executor_registration_timeout. >> > >> https://github.com/mesosphere/marathon/blob/master/docs/docs/native-docker.md#configure-marathon >> > >> > On 16 October 2014 10:37, Nils De Moor <[email protected]> wrote: >> >> >> >> Hi, >> >> >> >> Environment: >> >> - Clean vagrant install, 1 master, 1 slave (same behaviour on >> production >> >> cluster with 3 masters, 6 slaves) >> >> - Mesos 0.20.1 >> >> - Marathon 0.7.3 >> >> - Docker 1.2.0 >> >> >> >> Slave config: >> >> - containerizers: "docker,mesos" >> >> - executor_registration_timeout: 5mins >> >> >> >> When is start docker container tasks, they start being pulled from the >> >> HUB, but after 1 minute mesos kills them. >> >> In the background though the pull is still finishing and when >> everything >> >> is pulled in the docker container is started, without mesos knowing >> about >> >> it. >> >> When I start the same task in mesos again (after I know the pull of the >> >> image is done), they run normally. >> >> >> >> So this leaves slaves with 'dirty' docker containers, as mesos has no >> >> knowledge about them. >> >> >> >> From the logs I get this: >> >> --- >> >> I1009 15:30:02.990291 1414 slave.cpp:1002] Got assigned task >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a for framework >> >> 20140904-160348-185204746-5050-27588-0000 >> >> I1009 15:30:02.990979 1414 slave.cpp:1112] Launching task >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a for framework >> >> 20140904-160348-185204746-5050-27588-0000 >> >> I1009 15:30:02.993341 1414 slave.cpp:1222] Queuing task >> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a' for executor >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework >> >> '20140904-160348-185204746-5050-27588-0000 >> >> I1009 15:30:02.995818 1409 docker.cpp:743] Starting container >> >> '25ac3310-71e4-4d10-8a4b-38add4537308' for task >> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a' (and executor >> >> 'test-app.23755452-4fc9-11e4-839b-080027c4337a') of framework >> >> '20140904-160348-185204746-5050-27588-0000' >> >> >> >> I1009 15:31:07.033287 1413 slave.cpp:1278] Asked to kill task >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework >> >> 20140904-160348-185204746-5050-27588-0000 >> >> I1009 15:31:07.034742 1413 slave.cpp:2088] Handling status update >> >> TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework >> >> 20140904-160348-185204746-5050-27588-0000 from @0.0.0.0:0 >> >> W1009 15:31:07.034881 1413 slave.cpp:1354] Killing the unregistered >> >> executor 'test-app.23755452-4fc9-11e4-839b-080027c4337a' of framework >> >> 20140904-160348-185204746-5050-27588-0000 because it has no tasks >> >> E1009 15:31:07.034945 1413 slave.cpp:2205] Failed to update resources >> for >> >> container 25ac3310-71e4-4d10-8a4b-38add4537308 of executor >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a running task >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a on status update for >> terminal >> >> task, destroying container: No container found >> >> I1009 15:31:07.035133 1413 status_update_manager.cpp:320] Received >> status >> >> update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for >> task >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework >> >> 20140904-160348-185204746-5050-27588-0000 >> >> I1009 15:31:07.035210 1413 status_update_manager.cpp:373] Forwarding >> >> status update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) >> for >> >> task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework >> >> 20140904-160348-185204746-5050-27588-0000 to [email protected]:5050 >> >> I1009 15:31:07.046167 1408 status_update_manager.cpp:398] Received >> status >> >> update acknowledgement (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) >> for task >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework >> >> 20140904-160348-185204746-5050-27588-0000 >> >> >> >> I1009 15:35:02.993736 1414 slave.cpp:3010] Terminating executor >> >> test-app.23755452-4fc9-11e4-839b-080027c4337a of framework >> >> 20140904-160348-185204746-5050-27588-0000 because it did not register >> within >> >> 5mins >> >> --- >> >> >> >> I already posted my question on the marathon board, as I first thought >> it >> >> was an issue on marathon's end: >> >> https://groups.google.com/forum/#!topic/marathon-framework/NT7_YIZnNoY >> >> >> >> >> >> Kind regards, >> >> Nils >> >> >> > >> > >

