There are a few things going on there, you're having ZooKeeper connectivity issues. And the master is not able to health check the agent.
I would recommend triaging what occurred in your network, but you can increase the master's health check timeouts as a mitigation. You can also control the maximum rate at which the master removes unhealthy agents. On Sun, Oct 4, 2015 at 1:45 AM, Jeremy Olexa <[email protected]> wrote: > Hello, > > > We have been observing some agent processes disconnects when our agent > processes are in another datacenter, A, and accessing the master cluster in > datacenter B. I would like to mitigate this issue because it ejects all the > applications running and then all of the sandbox links, etc, are not > available because the slave is "lost" > > > I have attached the disconnect portion of the log here: > > https://gist.github.com/jolexa/1a80e26a4b017846d083 > > I am curious if anyone can offer some advice on making the relevant Mesos > processes more resilient in this regard. I'm confused on all the timeout > options and I don't know exactly what to tweak safely. > > > Thanks for any assistance! > > -Jeremy > > >

