Hi, there are 3 zookeepers nodes. We've started our containers and this time I was watching the zookeepers and their condition with the "stat" command. It seems that zookeeper latency is not the issue, there was only about 8 connections, max latency time 134ms.
I'm still not sure what is the real cause here…from mesos-master log I see normal behaviour and the suddenly: Apr 27 18:02:37 systemd[1]: [email protected]: main process exited, code=exited, status=137/n/a If we run our containers all on one mesos-slave node, it works, but when distributed to three nodes, it's failing. On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <[email protected]> wrote: > Hi Martin, > > how many ZooKeepers do you have? Is your transaction log on a dedicated > disk? How many clients are approximately connecting? > > have a look at > http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices > > Tomas > > On 27 April 2015 at 10:58, Martin Stiborský <[email protected]> > wrote: > >> Hello guys, >> we are running a mesos stack on CoreOS, with three zookeeper nodes. >> >> We can start a docker containers with Marathon and all, that's fine, but >> some of the docker containers generates high network load, while >> communicating between nodes/containers and I think that' the reason why the >> zookeper is failing. >> From logs, I can see this error: >> >> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper >> server... >> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 >> 05:06:45,705 [myid:1] - WARN [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream >> exception >> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: >> EndOfStreamException: Unable to read additional data from client sessionid >> 0x14cf73508730003, likely client has closed socket >> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) >> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at >> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) >> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at >> java.lang.Thread.run(Thread.java:745) >> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 >> 05:06:45,707 [myid:1] - INFO [NIOServerCxn.Factory: >> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec >> tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003 >> >> And then all ZK nodes goes down…mesos fails as well and that's it. The >> cluster eventually do recover, but the tasks running are gone, not finished. >> >> I have to say I don't have a proper monitoring in place yet, working on >> it right now, so I can't rely on real data to prove this assumption, but >> it's my guess. >> So if you can confirm that this makes sense, or share with me your >> experiences, that would be pretty valuable for me right now. >> >> Thanks a lot! >> > >

