Re: zookeeper quorum failing because of high network load

Martin Stiborský Mon, 27 Apr 2015 11:21:13 -0700

Hi,
there are 3 zookeepers nodes.
We've started our containers and this time I was watching the zookeepers
and their condition with the "stat" command.
It seems that zookeeper latency is not the issue, there was only about 8
connections, max latency time 134ms.


I'm still not sure what is the real cause here…from mesos-master log I see
normal behaviour and the suddenly:
Apr 27 18:02:37 systemd[1]: [email protected]: main process exited,
code=exited, status=137/n/a

If we run our containers all on one mesos-slave node, it works, but when
distributed to three nodes, it's failing.


On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <[email protected]>
wrote:

> Hi Martin,
>
> how many ZooKeepers do you have? Is your transaction log on a dedicated
> disk? How many clients are approximately connecting?
>
> have a look at
> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>
> Tomas
>
> On 27 April 2015 at 10:58, Martin Stiborský <[email protected]>
> wrote:
>
>> Hello guys,
>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>
>> We can start a docker containers with Marathon and all, that's fine, but
>> some of the docker containers generates high network load, while
>> communicating between nodes/containers and I think that' the reason why the
>> zookeper is failing.
>> From logs, I can see this error:
>>
>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>> server...
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>> exception
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>> EndOfStreamException: Unable to read additional data from client sessionid
>> 0x14cf73508730003, likely client has closed socket
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>> java.lang.Thread.run(Thread.java:745)
>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>> tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003
>>
>> And then all ZK nodes goes down…mesos fails as well and that's it. The
>> cluster eventually do recover, but the tasks running are gone, not finished.
>>
>> I have to say I don't have a proper monitoring in place yet, working on
>> it right now, so I can't rely on real data to prove this assumption, but
>> it's my guess.
>> So if you can confirm that this makes sense, or share with me your
>> experiences, that would be pretty valuable for me right now.
>>
>> Thanks a lot!
>>
>
>

Re: zookeeper quorum failing because of high network load

Reply via email to