Re: Serious problem processing hearbeat on login stampede

Chang Song Thu, 14 Apr 2011 14:59:02 -0700

I only use one filesystem for all logs.


2011. 4. 15., 오전 1:00, Mahadev Konar 작성:

> Chang/Pat and others,
>  I didnt see this in the discussions above, but are you guys using a
> single disk or 2 disks for ZK? One for snapshot and one for txn
> logging?
> 
> thanks
> mahadev
> 
> 2011/4/14 Chang Song <tru64...@me.com>:
>> 
>> 2011. 4. 14., 오후 1:53, Patrick Hunt 작성:
>> 
>>> two additional thoughts come to mind:
>>> 
>>> 1) try running the ensemble with a single zk server, does this help at
>>> all? (it might provide a short term workaround, it also might provide
>>> some insight into what's causing the issue)
>> 
>> 
>> We are going to try this to see if we identify a culprit.
>> 
>> Thanks.
>> 
>> 
>> 
>>> 2) can you hold off some of the clients from the stampede? Perhaps add
>>> a random holdoff to each of the clients before connecting,
>>> additionally a similar random holdoff from closing the session. this
>>> seems like a straightforward change from your client side (easy to
>>> implement/try) but hard to tell given we don't have much insight into
>>> what your use case is.
>>> 
>>> 
>>> Anyone else in the community have any ideas?
>>> 
>>> 
>>> Patrick
>>> 
>>> 2011/4/13 Patrick Hunt <ph...@apache.org>:
>>>> 2011/4/13 Chang Song <tru64...@me.com>:
>>>>> 
>>>>> Patrick.
>>>>> Thank you for the reply.
>>>>> 
>>>>> We are very aware of all the things you mentioned below.
>>>>> None of those.
>>>>> 
>>>>> Not GC (we monitor every possible resource in JVM and system)
>>>>> No IO. No Swapping.
>>>>> No VM guest OS. No logging.
>>>>> 
>>>> 
>>>> Hm. ok, a few more ideas then:
>>>> 
>>>> 1) what is the connectivity like btw the servers?
>>>> 
>>>> What is the ping time btw them?
>>>> 
>>>> Is the system perhaps loading down the network during this test,
>>>> causing network latency to increase? Are all the nic cards (server and
>>>> client) configured correctly? I've seen a number of cases where
>>>> clients and/or server had incorrectly configured nics (ethtool
>>>> reported 10 MB/sec half duplex for what should be 1gigeth)
>>>> 
>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>> look like?
>>>> 
>>>> 3) create a JIRA and upload your 3 server configuration files. Include
>>>> the log4j.properties file you are using and any other details you
>>>> think might be useful. If you can upload a log file from when you see
>>>> this issue that would be useful. Upload any log file if you can't get
>>>> it from the time when you see the issue.
>>>> 
>>>>> 
>>>>> Oh, one thing I should mention is that it is not 1000 clients,
>>>>> 1000 login/logout per second. All operations like closeSession,
>>>>> ping takes more than 8 seconds (peak).
>>>>> 
>>>> 
>>>> Are you continuously logging in and the logging out, 1000 times per
>>>> second? That's not a good use case for ZK sessions in general. Perhaps
>>>> if you describe your use case in more detail it would help.
>>>> 
>>>> Patrick
>>>> 
>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>> goes up to 8.8 seconds during this flood.
>>>>> 
>>>>> To exactly reproduce this scenario, easiest way is to
>>>>> 
>>>>> - suspend All JVM client with debugger
>>>>> - Cause all client JVM OOME to create heap dump
>>>>> 
>>>>> in group B. All clients in group A will not be able to receive
>>>>> ping response in 5 seconds.
>>>>> 
>>>>> We need to fix this as soon as possible.
>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>> At least clients in Group A survives. But this increases
>>>>> our cluster failover time significantly.
>>>>> 
>>>>> Thank you, Patrick.
>>>>> 
>>>>> 
>>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>>    as the packet identifies itself as ping. No dice.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>> 
>>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>>> looked through the troubleshooting guide?
>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>> 
>>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>>> what we typically see there is that the cluster configuration has
>>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>>> the following may be an underlying cause:
>>>>>> 
>>>>>> 1) are you running in a virtualized environment?
>>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>>> the ZK serving cluster?
>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>> In particular ensuring that you are not swapping or going into gc
>>>>>> pause (both on the server and the client)
>>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>>> high latency for the clients
>>>>>> b) ensure that you are not swapping
>>>>>> c) ensure that other processes are not causing log writing
>>>>>> (transactional logging) to be slow.
>>>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64...@me.com> wrote:
>>>>>>> Hello, folks.
>>>>>>> 
>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>> Here's a brief scenario.
>>>>>>> 
>>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 
>>>>>>> sec ping), let's called
>>>>>>> these clients, group A.
>>>>>>> 
>>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same 
>>>>>>> time trying to
>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>>> 
>>>>>>> Now almost all clients in group A is not able to exchange ping within 
>>>>>>> session expire time (15 sec).
>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>> 
>>>>>>> We have looked into this issue a bit, found mostly synchronous nature 
>>>>>>> of session queue processing.
>>>>>>> Latency between ping request and response ranges from 10ms up to 14 
>>>>>>> seconds during this login stampede.
>>>>>>> 
>>>>>>> Since session timeout is serious matter for our cluster, thus ping 
>>>>>>> should be done in psuedo realtime fashion.
>>>>>>> 
>>>>>>> I don't know exactly how these ping timeout policy in clients and 
>>>>>>> server, but failure to receive ping
>>>>>>> response in clients due to zookeeper login session seem very nonsense 
>>>>>>> to me.
>>>>>>> 
>>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>> 
>>>>>>> THis is very serious issue with Zookeeper for our mission-critical 
>>>>>>> system. Could anyone
>>>>>>> look into this?
>>>>>>> 
>>>>>>> I will try to file a bug.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Chang
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 
> 
> 
> 
> -- 
> thanks
> mahadev
> @mahadevkonar

Re: Serious problem processing hearbeat on login stampede

Reply via email to