two additional thoughts come to mind: 1) try running the ensemble with a single zk server, does this help at all? (it might provide a short term workaround, it also might provide some insight into what's causing the issue)
2) can you hold off some of the clients from the stampede? Perhaps add a random holdoff to each of the clients before connecting, additionally a similar random holdoff from closing the session. this seems like a straightforward change from your client side (easy to implement/try) but hard to tell given we don't have much insight into what your use case is. Anyone else in the community have any ideas? Patrick 2011/4/13 Patrick Hunt <[email protected]>: > 2011/4/13 Chang Song <[email protected]>: >> >> Patrick. >> Thank you for the reply. >> >> We are very aware of all the things you mentioned below. >> None of those. >> >> Not GC (we monitor every possible resource in JVM and system) >> No IO. No Swapping. >> No VM guest OS. No logging. >> > > Hm. ok, a few more ideas then: > > 1) what is the connectivity like btw the servers? > > What is the ping time btw them? > > Is the system perhaps loading down the network during this test, > causing network latency to increase? Are all the nic cards (server and > client) configured correctly? I've seen a number of cases where > clients and/or server had incorrectly configured nics (ethtool > reported 10 MB/sec half duplex for what should be 1gigeth) > > 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your > issue is happening, what's the %util of the disk? what's the iowait > look like? > > 3) create a JIRA and upload your 3 server configuration files. Include > the log4j.properties file you are using and any other details you > think might be useful. If you can upload a log file from when you see > this issue that would be useful. Upload any log file if you can't get > it from the time when you see the issue. > >> >> Oh, one thing I should mention is that it is not 1000 clients, >> 1000 login/logout per second. All operations like closeSession, >> ping takes more than 8 seconds (peak). >> > > Are you continuously logging in and the logging out, 1000 times per > second? That's not a good use case for ZK sessions in general. Perhaps > if you describe your use case in more detail it would help. > > Patrick > >> It's about CommitProcessor thread queueing (in leader). >> QueuedRequests goes up to 800, so does commitedRequests and >> PendingRequestElapsedTime. PendingRequestElapsedTime >> goes up to 8.8 seconds during this flood. >> >> To exactly reproduce this scenario, easiest way is to >> >> - suspend All JVM client with debugger >> - Cause all client JVM OOME to create heap dump >> >> in group B. All clients in group A will not be able to receive >> ping response in 5 seconds. >> >> We need to fix this as soon as possible. >> What we do as a workaround is to raise sessionTimeout to 40 sec. >> At least clients in Group A survives. But this increases >> our cluster failover time significantly. >> >> Thank you, Patrick. >> >> >> ps. We actually push ping request to FinalRequestProcessor as soon >> as the packet identifies itself as ping. No dice. >> >> >> >> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성: >> >>> Hi Chang, it sounds like you may have an issue with your cluster >>> environment/setup, or perhaps a resource (GC/mem) issue. Have you >>> looked through the troubleshooting guide? >>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting >>> >>> In particular 1000 clients connecting should be fine, I've personally >>> seen clusters of 7-10 thousand clients. Keep in mind that each session >>> establishment is essentially a write (so the quorum in involved) and >>> what we typically see there is that the cluster configuration has >>> issues. 14 seconds for a ping response is huge and indicates one of >>> the following may be an underlying cause: >>> >>> 1) are you running in a virtualized environment? >>> 2) are you co-locating other services on the same host(s) that make up >>> the ZK serving cluster? >>> 3) have you followed the admin guide's "things to avoid"? >>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems >>> In particular ensuring that you are not swapping or going into gc >>> pause (both on the server and the client) >>> a) try turning on GC logging and ensure that you are not going into GC >>> pause, see the troubleshooting guide, this is the most common cause of >>> high latency for the clients >>> b) ensure that you are not swapping >>> c) ensure that other processes are not causing log writing >>> (transactional logging) to be slow. >>> >>> Patrick >>> >>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <[email protected]> wrote: >>>> Hello, folks. >>>> >>>> We have ran into a very serious issue with Zookeeper. >>>> Here's a brief scenario. >>>> >>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec >>>> ping), let's called >>>> these clients, group A. >>>> >>>> Now 1000 new clients (let's call these, group B) starts up at the same >>>> time trying to >>>> connect to a three-node ZK ensemble, creating ZK createSession stampede. >>>> >>>> Now almost all clients in group A is not able to exchange ping within >>>> session expire time (15 sec). >>>> Thus clients in group A drops out of the cluster. >>>> >>>> We have looked into this issue a bit, found mostly synchronous nature of >>>> session queue processing. >>>> Latency between ping request and response ranges from 10ms up to 14 >>>> seconds during this login stampede. >>>> >>>> Since session timeout is serious matter for our cluster, thus ping should >>>> be done in psuedo realtime fashion. >>>> >>>> I don't know exactly how these ping timeout policy in clients and server, >>>> but failure to receive ping >>>> response in clients due to zookeeper login session seem very nonsense to >>>> me. >>>> >>>> Shouldn't we have a separate ping/heartbeat queue and thread? >>>> Or even multiple ping queues/threads to keep realtime heartbeat? >>>> >>>> THis is very serious issue with Zookeeper for our mission-critical system. >>>> Could anyone >>>> look into this? >>>> >>>> I will try to file a bug. >>>> >>>> Thank you. >>>> >>>> Chang >>>> >>>> >>>> >> >> >
