Chang/Pat and others, I didnt see this in the discussions above, but are you guys using a single disk or 2 disks for ZK? One for snapshot and one for txn logging?
thanks mahadev 2011/4/14 Chang Song <tru64...@me.com>: > > 2011. 4. 14., 오후 1:53, Patrick Hunt 작성: > >> two additional thoughts come to mind: >> >> 1) try running the ensemble with a single zk server, does this help at >> all? (it might provide a short term workaround, it also might provide >> some insight into what's causing the issue) > > > We are going to try this to see if we identify a culprit. > > Thanks. > > > >> 2) can you hold off some of the clients from the stampede? Perhaps add >> a random holdoff to each of the clients before connecting, >> additionally a similar random holdoff from closing the session. this >> seems like a straightforward change from your client side (easy to >> implement/try) but hard to tell given we don't have much insight into >> what your use case is. >> >> >> Anyone else in the community have any ideas? >> >> >> Patrick >> >> 2011/4/13 Patrick Hunt <ph...@apache.org>: >>> 2011/4/13 Chang Song <tru64...@me.com>: >>>> >>>> Patrick. >>>> Thank you for the reply. >>>> >>>> We are very aware of all the things you mentioned below. >>>> None of those. >>>> >>>> Not GC (we monitor every possible resource in JVM and system) >>>> No IO. No Swapping. >>>> No VM guest OS. No logging. >>>> >>> >>> Hm. ok, a few more ideas then: >>> >>> 1) what is the connectivity like btw the servers? >>> >>> What is the ping time btw them? >>> >>> Is the system perhaps loading down the network during this test, >>> causing network latency to increase? Are all the nic cards (server and >>> client) configured correctly? I've seen a number of cases where >>> clients and/or server had incorrectly configured nics (ethtool >>> reported 10 MB/sec half duplex for what should be 1gigeth) >>> >>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your >>> issue is happening, what's the %util of the disk? what's the iowait >>> look like? >>> >>> 3) create a JIRA and upload your 3 server configuration files. Include >>> the log4j.properties file you are using and any other details you >>> think might be useful. If you can upload a log file from when you see >>> this issue that would be useful. Upload any log file if you can't get >>> it from the time when you see the issue. >>> >>>> >>>> Oh, one thing I should mention is that it is not 1000 clients, >>>> 1000 login/logout per second. All operations like closeSession, >>>> ping takes more than 8 seconds (peak). >>>> >>> >>> Are you continuously logging in and the logging out, 1000 times per >>> second? That's not a good use case for ZK sessions in general. Perhaps >>> if you describe your use case in more detail it would help. >>> >>> Patrick >>> >>>> It's about CommitProcessor thread queueing (in leader). >>>> QueuedRequests goes up to 800, so does commitedRequests and >>>> PendingRequestElapsedTime. PendingRequestElapsedTime >>>> goes up to 8.8 seconds during this flood. >>>> >>>> To exactly reproduce this scenario, easiest way is to >>>> >>>> - suspend All JVM client with debugger >>>> - Cause all client JVM OOME to create heap dump >>>> >>>> in group B. All clients in group A will not be able to receive >>>> ping response in 5 seconds. >>>> >>>> We need to fix this as soon as possible. >>>> What we do as a workaround is to raise sessionTimeout to 40 sec. >>>> At least clients in Group A survives. But this increases >>>> our cluster failover time significantly. >>>> >>>> Thank you, Patrick. >>>> >>>> >>>> ps. We actually push ping request to FinalRequestProcessor as soon >>>> as the packet identifies itself as ping. No dice. >>>> >>>> >>>> >>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성: >>>> >>>>> Hi Chang, it sounds like you may have an issue with your cluster >>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you >>>>> looked through the troubleshooting guide? >>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting >>>>> >>>>> In particular 1000 clients connecting should be fine, I've personally >>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session >>>>> establishment is essentially a write (so the quorum in involved) and >>>>> what we typically see there is that the cluster configuration has >>>>> issues. 14 seconds for a ping response is huge and indicates one of >>>>> the following may be an underlying cause: >>>>> >>>>> 1) are you running in a virtualized environment? >>>>> 2) are you co-locating other services on the same host(s) that make up >>>>> the ZK serving cluster? >>>>> 3) have you followed the admin guide's "things to avoid"? >>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems >>>>> In particular ensuring that you are not swapping or going into gc >>>>> pause (both on the server and the client) >>>>> a) try turning on GC logging and ensure that you are not going into GC >>>>> pause, see the troubleshooting guide, this is the most common cause of >>>>> high latency for the clients >>>>> b) ensure that you are not swapping >>>>> c) ensure that other processes are not causing log writing >>>>> (transactional logging) to be slow. >>>>> >>>>> Patrick >>>>> >>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64...@me.com> wrote: >>>>>> Hello, folks. >>>>>> >>>>>> We have ran into a very serious issue with Zookeeper. >>>>>> Here's a brief scenario. >>>>>> >>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 >>>>>> sec ping), let's called >>>>>> these clients, group A. >>>>>> >>>>>> Now 1000 new clients (let's call these, group B) starts up at the same >>>>>> time trying to >>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede. >>>>>> >>>>>> Now almost all clients in group A is not able to exchange ping within >>>>>> session expire time (15 sec). >>>>>> Thus clients in group A drops out of the cluster. >>>>>> >>>>>> We have looked into this issue a bit, found mostly synchronous nature of >>>>>> session queue processing. >>>>>> Latency between ping request and response ranges from 10ms up to 14 >>>>>> seconds during this login stampede. >>>>>> >>>>>> Since session timeout is serious matter for our cluster, thus ping >>>>>> should be done in psuedo realtime fashion. >>>>>> >>>>>> I don't know exactly how these ping timeout policy in clients and >>>>>> server, but failure to receive ping >>>>>> response in clients due to zookeeper login session seem very nonsense to >>>>>> me. >>>>>> >>>>>> Shouldn't we have a separate ping/heartbeat queue and thread? >>>>>> Or even multiple ping queues/threads to keep realtime heartbeat? >>>>>> >>>>>> THis is very serious issue with Zookeeper for our mission-critical >>>>>> system. Could anyone >>>>>> look into this? >>>>>> >>>>>> I will try to file a bug. >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Chang >>>>>> >>>>>> >>>>>> >>>> >>>> >>> > > -- thanks mahadev @mahadevkonar