My bad, i dint really see the GC log entry. I agree with Camille and Phunt that the GC is unlikely cause here. Another thing to consider is; do the latency spikes correlate to snapshot file generation. Are you writing the snapshots to RAMDisk or spinning disk.
Without additional info/code, we can only guess. thanks, Kishore G On Mon, Feb 24, 2014 at 3:18 PM, Patrick Hunt <[email protected]> wrote: > fwiw I've used strace in the past for things like this (if you were to > rule out GC, I believe it could easily be GC in this case but that's > easy to identify). Had a nasty ssd write latency issue that we just > couldn't figure out. Using strace we were able to see that fsyncs were > taking a really long time in some cases. (bad disk firmware most > likely). YMMV but strace with grep/wc/awk/gnuplot is great for this > stuff. > > Patrick > > On Mon, Feb 24, 2014 at 2:13 PM, Camille Fournier <[email protected]> > wrote: > > You can try CMS. I don't think gc should be causing you pauses unless > it's > > actually cleaning old gen, eden GC should be pauseless. You can tune the > > pool sizes to have enough space in old gen so you won't need pauses. The > > log you have printed above seems to be indicating the pauseless GC so I'm > > surprised it is causing noticeable performance degredation. But GC > > ergonomics aren't that hard to manage, especially in an application that > > should be used within existing process memory. Have you tried running > this > > on an oracle JDK instead of OpenJDK? > > > > C > > > > > > On Mon, Feb 24, 2014 at 3:34 PM, kishore g <[email protected]> wrote: > > > >> try CMS garbage collector and see if it improves. I think you are great > at > >> debugging, being new to JAVA and ZK, you were able to correlate GC > activity > >> with latency spikes. Kudos for that. > >> > >> Try the following JVM Flags. > >> > >> -server -Xms<> -Xmx<> -XX:NewSize=<> -XX:MaxNewSize=<> > >> -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 > >> > >> If you use disk as backing store, i dont think you can get a consistent > >> read/write of 5ms. There are lot of limitations in the design (most of > them > >> are there to ensure consistency, for example every writes ensure that > >> transaction log is fsynced before acknowledging to the client). > >> > >> RAM disk might give your performance but you need to be prepared for the > >> catastrophic scenario where all zookeepers go down. > >> > >> thanks, > >> Kishore G > >> > >> > >> > >> On Mon, Feb 24, 2014 at 9:36 AM, jmmec <[email protected]> wrote: > >> > >> > Hey everyone, > >> > > >> > Did I mention that I'm a newbie to ZooKeeper and also to JAVA? :) > >> > > >> > I enabled some JAVA GC logs via the "java.env" file: > >> > > >> > export JVMFLAGS="-Xms1024m -Xmx1024m -XX:+PrintGCDetails > >> > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime" > >> > > >> > and confirmed that the periodic latency is due to JAVA GC operations. > >> > > >> > For example, below is a 26ms delay which corresponds to a 26ms delay > that > >> > my test app also saw (it uses the C API and connects to ZK remotely) > and > >> as > >> > also reported by ZK which is the only JAVA app running in the ZK > cluster: > >> > > >> > 2014-02-24T10:29:51.905-0600: [GC [PSYoungGen: > 275424K->12128K(305152K)] > >> > 325542K->73974K(1004544K), 0.0255720 secs] [Times: user=0.09 sys=0.00, > >> > real=0.03 secs] > >> > 2014-02-24T10:29:51.931-0600: Total time for which application threads > >> were > >> > stopped: 0.0261350 seconds > >> > > >> > JAVA JVM tuning seems to be more of a black art than a science with > >> respect > >> > to GC and other settings. I was wondering if anyone has any practical > >> > advice for JVM settings for the following configuration: > >> > > >> > a) ZK 3-node cluster running OpenJDK 1.7; ZK is the only app running > >> JAVA. > >> > b) Application znode data and watches will fit into < 100MB of RAM > (say > >> > 250k znodes with ~150 bytes per znode with 2 watchers per znode) > >> > > >> > Consistent and fast read / write latency - say 5ms or less - is > critical > >> > for the small dataset above. I'm trying to understand if this is > >> > obtainable with ZK & JAVA. I realize that other factors come into > play > >> as > >> > well (hardware / network). > >> > > >> > Thanks in advance for any advice. > >> > > >> > > >> > On Fri, Feb 21, 2014 at 7:51 AM, jmmec <[email protected]> wrote: > >> > > >> > > Thanks Camille, I definitely understand! :) > >> > > > >> > > The two questions at the top of mind regarding ZooKeeper are: > >> > > 1. How does it calculate latencies? I can dig into its code to see. > >> > > 2. Is there anything in particular that might cause it to have the > >> spiky > >> > > latency I've experienced? I think I ruled out the snapshot > behavior by > >> > > having a high snapCount. > >> > > > >> > > Some other things I am planning to explore: > >> > > 1. My test software is rightfully suspect, so I'll review it > carefully > >> > > again and will simplify it further so that it is doing the absolute > >> bare > >> > > minimum. > >> > > 2. I'm running OpenJDK 1.7.0_60-ea so might swap to an earlier > and/or > >> > > different distribution. > >> > > 3. I'm running ZooKeeper 3.4.5 and might fall back to the 3.3.6 > >> release. > >> > > > >> > > Hopefully one of the items above will reveal the root cause. Any > other > >> > > suggestions are welcome. > >> > > > >> > > > >> > > > >> > > On Thu, Feb 20, 2014 at 7:57 PM, Camille Fournier < > [email protected] > >> > >wrote: > >> > > > >> > >> I might suggest that you create a personal github and mock up a > >> > >> replication > >> > >> there :) I understand employers that own your code but unless > someone > >> > >> knows > >> > >> the answer off the top of their head, odds of finding the cause are > >> low > >> > >> without something that replicates it, and knowing how busy most of > us > >> > are > >> > >> here I don't know that we'll have time to do that for you. > >> > >> > >> > >> C > >> > >> > >> > >> > >> > >> On Thu, Feb 20, 2014 at 9:41 PM, jmmec <[email protected]> > wrote: > >> > >> > >> > >> > Thanks again, > >> > >> > > >> > >> > Unfortunately I can't share the test code since it is technically > >> the > >> > >> > property of my employer. > >> > >> > > >> > >> > It's very strange behavior. I think I've said that several times > >> now. > >> > >> > ha... > >> > >> > > >> > >> > Appreciate any additional help or advice or suggestions from > >> everyone > >> > >> and > >> > >> > anyone and their brother or sister. > >> > >> > > >> > >> > > >> > >> > > >> > >> > On Thu, Feb 20, 2014 at 8:10 PM, Camille Fournier < > >> [email protected] > >> > >> > >wrote: > >> > >> > > >> > >> > > Can you share the test code somewhere (github maybe?)? > >> > >> > > > >> > >> > > Thanks, > >> > >> > > C > >> > >> > > > >> > >> > > > >> > >> > > On Thu, Feb 20, 2014 at 9:08 PM, jmmec <[email protected]> > >> wrote: > >> > >> > > > >> > >> > > > Thanks for the quick reply. > >> > >> > > > > >> > >> > > > I did not try the "slow" test using a normal disk drive, > >> however I > >> > >> > first > >> > >> > > > discovered this problem when writing to a 7200RPM disk drive > at > >> a > >> > >> much > >> > >> > > > higher messaging rate (e.g. 1500 to 3000 creates/sec rather > than > >> > 84 > >> > >> > > > creates/sec). This is what caused me to start simplifying > the > >> > >> > > > configuration trying to find the root cause. As part of that > >> > >> > > > investigation, I created a RAM disk to avoid the hard drive, > but > >> > the > >> > >> > hard > >> > >> > > > drive wasn't the problem. I just haven't switched back to > the > >> > hard > >> > >> > > drive. > >> > >> > > > > >> > >> > > > I don't know what ZooKeeper is doing internally, or how & > why it > >> > is > >> > >> > > > deriving 76ms MAX latency. The very regular periodic pattern > >> > >> suggests > >> > >> > > > something odd. > >> > >> > > > > >> > >> > > > Hmmmm..... > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > > >> > > > >> > > >> >
