Re: zookeeper quorum failing because of high network load

Charles Baker Mon, 27 Apr 2015 22:34:02 -0700

Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3 nodes
as the mesos cluster? Does your application also use ZooKeeper to manage
it's own state? Are there any other services running on the machines and
does Mesos and ZK have enough resources? And as Tomas asked; is your ZK log
on a dedicated disk?



On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský <
[email protected]> wrote:

> Hi,
> there are 3 zookeepers nodes.
> We've started our containers and this time I was watching the zookeepers
> and their condition with the "stat" command.
> It seems that zookeeper latency is not the issue, there was only about 8
> connections, max latency time 134ms.
>
> I'm still not sure what is the real cause here…from mesos-master log I see
> normal behaviour and the suddenly:
> Apr 27 18:02:37 systemd[1]: [email protected]: main process exited,
> code=exited, status=137/n/a
>
> If we run our containers all on one mesos-slave node, it works, but when
> distributed to three nodes, it's failing.
>
>
> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <[email protected]>
> wrote:
>
>> Hi Martin,
>>
>> how many ZooKeepers do you have? Is your transaction log on a dedicated
>> disk? How many clients are approximately connecting?
>>
>> have a look at
>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>>
>> Tomas
>>
>> On 27 April 2015 at 10:58, Martin Stiborský <[email protected]>
>> wrote:
>>
>>> Hello guys,
>>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>>
>>> We can start a docker containers with Marathon and all, that's fine, but
>>> some of the docker containers generates high network load, while
>>> communicating between nodes/containers and I think that' the reason why the
>>> zookeper is failing.
>>> From logs, I can see this error:
>>>
>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>>> server...
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>>> exception
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>>> EndOfStreamException: Unable to read additional data from client sessionid
>>> 0x14cf73508730003, likely client has closed socket
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>> java.lang.Thread.run(Thread.java:745)
>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>>> tion for client /10.60.11.82:58082 which had sessionid 0x14cf73508730003
>>>
>>> And then all ZK nodes goes down…mesos fails as well and that's it. The
>>> cluster eventually do recover, but the tasks running are gone, not finished.
>>>
>>> I have to say I don't have a proper monitoring in place yet, working on
>>> it right now, so I can't rely on real data to prove this assumption, but
>>> it's my guess.
>>> So if you can confirm that this makes sense, or share with me your
>>> experiences, that would be pretty valuable for me right now.
>>>
>>> Thanks a lot!
>>>
>>
>>

Re: zookeeper quorum failing because of high network load

Reply via email to