Hello guys, we are running a mesos stack on CoreOS, with three zookeeper nodes.
We can start a docker containers with Marathon and all, that's fine, but some of the docker containers generates high network load, while communicating between nodes/containers and I think that' the reason why the zookeper is failing. >From logs, I can see this error: Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper server... Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 05:06:45,705 [myid:1] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: EndOfStreamException: Unable to read additional data from client sessionid 0x14cf73508730003, likely client has closed socket Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at java.lang.Thread.run(Thread.java:745) Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 05:06:45,707 [myid:1] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003 And then all ZK nodes goes down…mesos fails as well and that's it. The cluster eventually do recover, but the tasks running are gone, not finished. I have to say I don't have a proper monitoring in place yet, working on it right now, so I can't rely on real data to prove this assumption, but it's my guess. So if you can confirm that this makes sense, or share with me your experiences, that would be pretty valuable for me right now. Thanks a lot!

