I only use one filesystem for all logs.
2011. 4. 15., 오전 1:00, Mahadev Konar 작성: > Chang/Pat and others, > I didnt see this in the discussions above, but are you guys using a > single disk or 2 disks for ZK? One for snapshot and one for txn > logging? > > thanks > mahadev > > 2011/4/14 Chang Song <tru64...@me.com>: >> >> 2011. 4. 14., 오후 1:53, Patrick Hunt 작성: >> >>> two additional thoughts come to mind: >>> >>> 1) try running the ensemble with a single zk server, does this help at >>> all? (it might provide a short term workaround, it also might provide >>> some insight into what's causing the issue) >> >> >> We are going to try this to see if we identify a culprit. >> >> Thanks. >> >> >> >>> 2) can you hold off some of the clients from the stampede? Perhaps add >>> a random holdoff to each of the clients before connecting, >>> additionally a similar random holdoff from closing the session. this >>> seems like a straightforward change from your client side (easy to >>> implement/try) but hard to tell given we don't have much insight into >>> what your use case is. >>> >>> >>> Anyone else in the community have any ideas? >>> >>> >>> Patrick >>> >>> 2011/4/13 Patrick Hunt <ph...@apache.org>: >>>> 2011/4/13 Chang Song <tru64...@me.com>: >>>>> >>>>> Patrick. >>>>> Thank you for the reply. >>>>> >>>>> We are very aware of all the things you mentioned below. >>>>> None of those. >>>>> >>>>> Not GC (we monitor every possible resource in JVM and system) >>>>> No IO. No Swapping. >>>>> No VM guest OS. No logging. >>>>> >>>> >>>> Hm. ok, a few more ideas then: >>>> >>>> 1) what is the connectivity like btw the servers? >>>> >>>> What is the ping time btw them? >>>> >>>> Is the system perhaps loading down the network during this test, >>>> causing network latency to increase? Are all the nic cards (server and >>>> client) configured correctly? I've seen a number of cases where >>>> clients and/or server had incorrectly configured nics (ethtool >>>> reported 10 MB/sec half duplex for what should be 1gigeth) >>>> >>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your >>>> issue is happening, what's the %util of the disk? what's the iowait >>>> look like? >>>> >>>> 3) create a JIRA and upload your 3 server configuration files. Include >>>> the log4j.properties file you are using and any other details you >>>> think might be useful. If you can upload a log file from when you see >>>> this issue that would be useful. Upload any log file if you can't get >>>> it from the time when you see the issue. >>>> >>>>> >>>>> Oh, one thing I should mention is that it is not 1000 clients, >>>>> 1000 login/logout per second. All operations like closeSession, >>>>> ping takes more than 8 seconds (peak). >>>>> >>>> >>>> Are you continuously logging in and the logging out, 1000 times per >>>> second? That's not a good use case for ZK sessions in general. Perhaps >>>> if you describe your use case in more detail it would help. >>>> >>>> Patrick >>>> >>>>> It's about CommitProcessor thread queueing (in leader). >>>>> QueuedRequests goes up to 800, so does commitedRequests and >>>>> PendingRequestElapsedTime. PendingRequestElapsedTime >>>>> goes up to 8.8 seconds during this flood. >>>>> >>>>> To exactly reproduce this scenario, easiest way is to >>>>> >>>>> - suspend All JVM client with debugger >>>>> - Cause all client JVM OOME to create heap dump >>>>> >>>>> in group B. All clients in group A will not be able to receive >>>>> ping response in 5 seconds. >>>>> >>>>> We need to fix this as soon as possible. >>>>> What we do as a workaround is to raise sessionTimeout to 40 sec. >>>>> At least clients in Group A survives. But this increases >>>>> our cluster failover time significantly. >>>>> >>>>> Thank you, Patrick. >>>>> >>>>> >>>>> ps. We actually push ping request to FinalRequestProcessor as soon >>>>> as the packet identifies itself as ping. No dice. >>>>> >>>>> >>>>> >>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성: >>>>> >>>>>> Hi Chang, it sounds like you may have an issue with your cluster >>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you >>>>>> looked through the troubleshooting guide? >>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting >>>>>> >>>>>> In particular 1000 clients connecting should be fine, I've personally >>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session >>>>>> establishment is essentially a write (so the quorum in involved) and >>>>>> what we typically see there is that the cluster configuration has >>>>>> issues. 14 seconds for a ping response is huge and indicates one of >>>>>> the following may be an underlying cause: >>>>>> >>>>>> 1) are you running in a virtualized environment? >>>>>> 2) are you co-locating other services on the same host(s) that make up >>>>>> the ZK serving cluster? >>>>>> 3) have you followed the admin guide's "things to avoid"? >>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems >>>>>> In particular ensuring that you are not swapping or going into gc >>>>>> pause (both on the server and the client) >>>>>> a) try turning on GC logging and ensure that you are not going into GC >>>>>> pause, see the troubleshooting guide, this is the most common cause of >>>>>> high latency for the clients >>>>>> b) ensure that you are not swapping >>>>>> c) ensure that other processes are not causing log writing >>>>>> (transactional logging) to be slow. >>>>>> >>>>>> Patrick >>>>>> >>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64...@me.com> wrote: >>>>>>> Hello, folks. >>>>>>> >>>>>>> We have ran into a very serious issue with Zookeeper. >>>>>>> Here's a brief scenario. >>>>>>> >>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus 5 >>>>>>> sec ping), let's called >>>>>>> these clients, group A. >>>>>>> >>>>>>> Now 1000 new clients (let's call these, group B) starts up at the same >>>>>>> time trying to >>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede. >>>>>>> >>>>>>> Now almost all clients in group A is not able to exchange ping within >>>>>>> session expire time (15 sec). >>>>>>> Thus clients in group A drops out of the cluster. >>>>>>> >>>>>>> We have looked into this issue a bit, found mostly synchronous nature >>>>>>> of session queue processing. >>>>>>> Latency between ping request and response ranges from 10ms up to 14 >>>>>>> seconds during this login stampede. >>>>>>> >>>>>>> Since session timeout is serious matter for our cluster, thus ping >>>>>>> should be done in psuedo realtime fashion. >>>>>>> >>>>>>> I don't know exactly how these ping timeout policy in clients and >>>>>>> server, but failure to receive ping >>>>>>> response in clients due to zookeeper login session seem very nonsense >>>>>>> to me. >>>>>>> >>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread? >>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat? >>>>>>> >>>>>>> THis is very serious issue with Zookeeper for our mission-critical >>>>>>> system. Could anyone >>>>>>> look into this? >>>>>>> >>>>>>> I will try to file a bug. >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> Chang >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>>> >> >> > > > > -- > thanks > mahadev > @mahadevkonar