Re: zookeeper quorum failing because of high network load

Martin Stiborský Tue, 28 Apr 2015 01:26:24 -0700

Hi guys,
these machines are relatively beefy - Dell PowerEdge r710 with 2x QC Xeon,
144GB RAM, CoreOS is deployed on baremetal.
- ZK is running on the same 3 nodes as the mesos cluster
- our application is not using ZK
- nothing else running on the stack, only 1 mesos master, 3 mesos slaves
 and marathon, all of this on top of CoreOS booted from iPXE from network
- ZK log is not on dedicated disk, I can put it on NFS share


The pattern is always the same. We start first container on the first node,
it's a database, then we run the second container with our application on
the second cluster node, the application loads data from the database
container on first node, then after about 6 minute, the stack goes down.

If we run both containers on same node, it's fine. That's why I tend to
blame network, but can't find the problem.

On Tue, Apr 28, 2015 at 7:33 AM Charles Baker <[email protected]> wrote:

> Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3 nodes
> as the mesos cluster? Does your application also use ZooKeeper to manage
> it's own state? Are there any other services running on the machines and
> does Mesos and ZK have enough resources? And as Tomas asked; is your ZK log
> on a dedicated disk?
>
>
> On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský <
> [email protected]> wrote:
>
>> Hi,
>> there are 3 zookeepers nodes.
>> We've started our containers and this time I was watching the zookeepers
>> and their condition with the "stat" command.
>> It seems that zookeeper latency is not the issue, there was only about 8
>> connections, max latency time 134ms.
>>
>> I'm still not sure what is the real cause here…from mesos-master log I
>> see normal behaviour and the suddenly:
>> Apr 27 18:02:37 systemd[1]: [email protected]: main process exited,
>> code=exited, status=137/n/a
>>
>> If we run our containers all on one mesos-slave node, it works, but when
>> distributed to three nodes, it's failing.
>>
>>
>> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <[email protected]>
>> wrote:
>>
>>> Hi Martin,
>>>
>>> how many ZooKeepers do you have? Is your transaction log on a dedicated
>>> disk? How many clients are approximately connecting?
>>>
>>> have a look at
>>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>>>
>>> Tomas
>>>
>>> On 27 April 2015 at 10:58, Martin Stiborský <[email protected]>
>>> wrote:
>>>
>>>> Hello guys,
>>>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>>>
>>>> We can start a docker containers with Marathon and all, that's fine,
>>>> but some of the docker containers generates high network load, while
>>>> communicating between nodes/containers and I think that' the reason why the
>>>> zookeper is failing.
>>>> From logs, I can see this error:
>>>>
>>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>>>> server...
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>>>> exception
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>>>> EndOfStreamException: Unable to read additional data from client sessionid
>>>> 0x14cf73508730003, likely client has closed socket
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>> java.lang.Thread.run(Thread.java:745)
>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>>>> tion for client /10.60.11.82:58082 which had sessionid
>>>> 0x14cf73508730003
>>>>
>>>> And then all ZK nodes goes down…mesos fails as well and that's it. The
>>>> cluster eventually do recover, but the tasks running are gone, not 
>>>> finished.
>>>>
>>>> I have to say I don't have a proper monitoring in place yet, working on
>>>> it right now, so I can't rely on real data to prove this assumption, but
>>>> it's my guess.
>>>> So if you can confirm that this makes sense, or share with me your
>>>> experiences, that would be pretty valuable for me right now.
>>>>
>>>> Thanks a lot!
>>>>
>>>
>>>

Re: zookeeper quorum failing because of high network load

Reply via email to