Hi all. Mesos version = 0.23.0-1.0.ubuntu1404 (mesosphere APT repo) Marathon version = 0.10.1 (mesosphere APT repo)
Hopefully this is a simple one for someone to answer, though I couldn't find anything immediately obvious in the documentation. We're trialling Mesos in a cloud (EC2/GCE) environment and the one thing that continues to bite us in the ass is this; continued task failures until the docker image is fully downloaded! Why is this!? Some of our images a small (say 200MB), some much larger (2GB) due to the nature of the software packages we're containerising. Regardless of this size, they fail the first dozen (or more) times until one of the slaves has pulled the image. Why is there an apparent hard time-out and how can I avoid it? I don't want the task to register as a fail - it hasn't even had a chance to run yet! Up until now we've just been tolerating the bouncing around of these tasks but it's now reached a point where it's darn annoying ;) I've tried setting executor_registration_timeout to '5mins' but this made no apparent difference (every minute the task is killed still). I should note that these tasks are launched using the Marathon framework and I've tried setting 'task_launch_timeout' to '3000' and again, it makes no difference. Based on a brief glance of a mesos slave log file it seems the master instructs the slave to kill the task off after 1 minute. Please advise. Cheers, Jim -- Senior Code Pig Industrial Light & Magic

