zookeeper quorum failing because of high network load

Martin Stiborský Mon, 27 Apr 2015 01:59:22 -0700

Hello guys,
we are running a mesos stack on CoreOS, with three zookeeper nodes.


We can start a docker containers with Marathon and all, that's fine, but
some of the docker containers generates high network load, while
communicating between nodes/containers and I think that' the reason why the
zookeper is failing.
>From logs, I can see this error:

Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
server...
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 05:06:45,705
[myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
exception
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: EndOfStreamException:
Unable to read additional data from client sessionid 0x14cf73508730003,
likely client has closed socket
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
java.lang.Thread.run(Thread.java:745)
Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27 05:06:45,707
[myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003

And then all ZK nodes goes down…mesos fails as well and that's it. The
cluster eventually do recover, but the tasks running are gone, not finished.

I have to say I don't have a proper monitoring in place yet, working on it
right now, so I can't rely on real data to prove this assumption, but it's
my guess.
So if you can confirm that this makes sense, or share with me your
experiences, that would be pretty valuable for me right now.

Thanks a lot!

zookeeper quorum failing because of high network load

Reply via email to