2011. 4. 14., 오후 1:53, Patrick Hunt 작성: > two additional thoughts come to mind: > > 1) try running the ensemble with a single zk server, does this help at > all? (it might provide a short term workaround, it also might provide > some insight into what's causing the issue)
We are going to try this to see if we identify a culprit. Thanks. > 2) can you hold off some of the clients from the stampede? Perhaps add > a random holdoff to each of the clients before connecting, > additionally a similar random holdoff from closing the session. this > seems like a straightforward change from your client side (easy to > implement/try) but hard to tell given we don't have much insight into > what your use case is. > > > Anyone else in the community have any ideas? > > > Patrick > > 2011/4/13 Patrick Hunt <ph...@apache.org>: >> 2011/4/13 Chang Song <tru64...@me.com>: >>> >>> Patrick. >>> Thank you for the reply. >>> >>> We are very aware of all the things you mentioned below. >>> None of those. >>> >>> Not GC (we monitor every possible resource in JVM and system) >>> No IO. No Swapping. >>> No VM guest OS. No logging. >>> >> >> Hm. ok, a few more ideas then: >> >> 1) what is the connectivity like btw the servers? >> >> What is the ping time btw them? >> >> Is the system perhaps loading down the network during this test, >> causing network latency to increase? Are all the nic cards (server and >> client) configured correctly? I've seen a number of cases where >> clients and/or server had incorrectly configured nics (ethtool >> reported 10 MB/sec half duplex for what should be 1gigeth) >> >> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your >> issue is happening, what's the %util of the disk? what's the iowait >> look like? >> >> 3) create a JIRA and upload your 3 server configuration files. Include >> the log4j.properties file you are using and any other details you >> think might be useful. If you can upload a log file from when you see >> this issue that would be useful. Upload any log file if you can't get >> it from the time when you see the issue. >> >>> >>> Oh, one thing I should mention is that it is not 1000 clients, >>> 1000 login/logout per second. All operations like closeSession, >>> ping takes more than 8 seconds (peak). >>> >> >> Are you continuously logging in and the logging out, 1000 times per >> second? That's not a good use case for ZK sessions in general. Perhaps >> if you describe your use case in more detail it would help. >> >> Patrick >> >>> It's about CommitProcessor thread queueing (in leader). >>> QueuedRequests goes up to 800, so does commitedRequests and >>> PendingRequestElapsedTime. PendingRequestElapsedTime >>> goes up to 8.8 seconds during this flood. >>> >>> To exactly reproduce this scenario, easiest way is to >>> >>> - suspend All JVM client with debugger >>> - Cause all client JVM OOME to create heap dump >>> >>> in group B. All clients in group A will not be able to receive >>> ping response in 5 seconds. >>> >>> We need to fix this as soon as possible. >>> What we do as a workaround is to raise sessionTimeout to 40 sec. >>> At least clients in Group A survives. But this increases >>> our cluster failover time significantly. >>> >>> Thank you, Patrick. >>> >>> >>> ps. We actually push ping request to FinalRequestProcessor as soon >>> as the packet identifies itself as ping. No dice. >>> >>> >>> >>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성: >>> >>>> Hi Chang, it sounds like you may have an issue with your cluster >>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you >>>> looked through the troubleshooting guide? >>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting >>>> >>>> In particular 1000 clients connecting should be fine, I've personally >>>> seen clusters of 7-10 thousand clients. Keep in mind that each session >>>> establishment is essentially a write (so the quorum in involved) and >>>> what we typically see there is that the cluster configuration has >>>> issues. 14 seconds for a ping response is huge and indicates one of >>>> the following may be an underlying cause: >>>> >>>> 1) are you running in a virtualized environment? >>>> 2) are you co-locating other services on the same host(s) that make up >>>> the ZK serving cluster? >>>> 3) have you followed the admin guide's "things to avoid"? >>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems >>>> In particular ensuring that you are not swapping or going into gc >>>> pause (both on the server and the client) >>>> a) try turning on GC logging and ensure that you are not going into GC >>>> pause, see the troubleshooting guide, this is the most common cause of >>>> high latency for the clients >>>> b) ensure that you are not swapping >>>> c) ensure that other processes are not causing log writing >>>> (transactional logging) to be slow. >>>> >>>> Patrick >>>> >>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64...@me.com> wrote: >>>>> Hello, folks. >>>>> >>>>> We have ran into a very serious issue with Zookeeper. >>>>> Here's a brief scenario. >>>>> >>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec >>>>> ping), let's called >>>>> these clients, group A. >>>>> >>>>> Now 1000 new clients (let's call these, group B) starts up at the same >>>>> time trying to >>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede. >>>>> >>>>> Now almost all clients in group A is not able to exchange ping within >>>>> session expire time (15 sec). >>>>> Thus clients in group A drops out of the cluster. >>>>> >>>>> We have looked into this issue a bit, found mostly synchronous nature of >>>>> session queue processing. >>>>> Latency between ping request and response ranges from 10ms up to 14 >>>>> seconds during this login stampede. >>>>> >>>>> Since session timeout is serious matter for our cluster, thus ping should >>>>> be done in psuedo realtime fashion. >>>>> >>>>> I don't know exactly how these ping timeout policy in clients and server, >>>>> but failure to receive ping >>>>> response in clients due to zookeeper login session seem very nonsense to >>>>> me. >>>>> >>>>> Shouldn't we have a separate ping/heartbeat queue and thread? >>>>> Or even multiple ping queues/threads to keep realtime heartbeat? >>>>> >>>>> THis is very serious issue with Zookeeper for our mission-critical >>>>> system. Could anyone >>>>> look into this? >>>>> >>>>> I will try to file a bug. >>>>> >>>>> Thank you. >>>>> >>>>> Chang >>>>> >>>>> >>>>> >>> >>> >>