Hi guys, these machines are relatively beefy - Dell PowerEdge r710 with 2x QC Xeon, 144GB RAM, CoreOS is deployed on baremetal. - ZK is running on the same 3 nodes as the mesos cluster - our application is not using ZK - nothing else running on the stack, only 1 mesos master, 3 mesos slaves and marathon, all of this on top of CoreOS booted from iPXE from network - ZK log is not on dedicated disk, I can put it on NFS share
The pattern is always the same. We start first container on the first node, it's a database, then we run the second container with our application on the second cluster node, the application loads data from the database container on first node, then after about 6 minute, the stack goes down. If we run both containers on same node, it's fine. That's why I tend to blame network, but can't find the problem. On Tue, Apr 28, 2015 at 7:33 AM Charles Baker <[email protected]> wrote: > Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3 nodes > as the mesos cluster? Does your application also use ZooKeeper to manage > it's own state? Are there any other services running on the machines and > does Mesos and ZK have enough resources? And as Tomas asked; is your ZK log > on a dedicated disk? > > > On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský < > [email protected]> wrote: > >> Hi, >> there are 3 zookeepers nodes. >> We've started our containers and this time I was watching the zookeepers >> and their condition with the "stat" command. >> It seems that zookeeper latency is not the issue, there was only about 8 >> connections, max latency time 134ms. >> >> I'm still not sure what is the real cause here…from mesos-master log I >> see normal behaviour and the suddenly: >> Apr 27 18:02:37 systemd[1]: [email protected]: main process exited, >> code=exited, status=137/n/a >> >> If we run our containers all on one mesos-slave node, it works, but when >> distributed to three nodes, it's failing. >> >> >> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <[email protected]> >> wrote: >> >>> Hi Martin, >>> >>> how many ZooKeepers do you have? Is your transaction log on a dedicated >>> disk? How many clients are approximately connecting? >>> >>> have a look at >>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices >>> >>> Tomas >>> >>> On 27 April 2015 at 10:58, Martin Stiborský <[email protected]> >>> wrote: >>> >>>> Hello guys, >>>> we are running a mesos stack on CoreOS, with three zookeeper nodes. >>>> >>>> We can start a docker containers with Marathon and all, that's fine, >>>> but some of the docker containers generates high network load, while >>>> communicating between nodes/containers and I think that' the reason why the >>>> zookeper is failing. >>>> From logs, I can see this error: >>>> >>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper >>>> server... >>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 >>>> 05:06:45,705 [myid:1] - WARN [NIOServerCxn.Factory: >>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream >>>> exception >>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: >>>> EndOfStreamException: Unable to read additional data from client sessionid >>>> 0x14cf73508730003, likely client has closed socket >>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at >>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) >>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at >>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) >>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at >>>> java.lang.Thread.run(Thread.java:745) >>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 >>>> 05:06:45,707 [myid:1] - INFO [NIOServerCxn.Factory: >>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec >>>> tion for client /10.60.11.82:58082 which had sessionid >>>> 0x14cf73508730003 >>>> >>>> And then all ZK nodes goes down…mesos fails as well and that's it. The >>>> cluster eventually do recover, but the tasks running are gone, not >>>> finished. >>>> >>>> I have to say I don't have a proper monitoring in place yet, working on >>>> it right now, so I can't rely on real data to prove this assumption, but >>>> it's my guess. >>>> So if you can confirm that this makes sense, or share with me your >>>> experiences, that would be pretty valuable for me right now. >>>> >>>> Thanks a lot! >>>> >>> >>>

