fwiw I've used strace in the past for things like this (if you were to rule out GC, I believe it could easily be GC in this case but that's easy to identify). Had a nasty ssd write latency issue that we just couldn't figure out. Using strace we were able to see that fsyncs were taking a really long time in some cases. (bad disk firmware most likely). YMMV but strace with grep/wc/awk/gnuplot is great for this stuff.
Patrick On Mon, Feb 24, 2014 at 2:13 PM, Camille Fournier <[email protected]> wrote: > You can try CMS. I don't think gc should be causing you pauses unless it's > actually cleaning old gen, eden GC should be pauseless. You can tune the > pool sizes to have enough space in old gen so you won't need pauses. The > log you have printed above seems to be indicating the pauseless GC so I'm > surprised it is causing noticeable performance degredation. But GC > ergonomics aren't that hard to manage, especially in an application that > should be used within existing process memory. Have you tried running this > on an oracle JDK instead of OpenJDK? > > C > > > On Mon, Feb 24, 2014 at 3:34 PM, kishore g <[email protected]> wrote: > >> try CMS garbage collector and see if it improves. I think you are great at >> debugging, being new to JAVA and ZK, you were able to correlate GC activity >> with latency spikes. Kudos for that. >> >> Try the following JVM Flags. >> >> -server -Xms<> -Xmx<> -XX:NewSize=<> -XX:MaxNewSize=<> >> -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 >> >> If you use disk as backing store, i dont think you can get a consistent >> read/write of 5ms. There are lot of limitations in the design (most of them >> are there to ensure consistency, for example every writes ensure that >> transaction log is fsynced before acknowledging to the client). >> >> RAM disk might give your performance but you need to be prepared for the >> catastrophic scenario where all zookeepers go down. >> >> thanks, >> Kishore G >> >> >> >> On Mon, Feb 24, 2014 at 9:36 AM, jmmec <[email protected]> wrote: >> >> > Hey everyone, >> > >> > Did I mention that I'm a newbie to ZooKeeper and also to JAVA? :) >> > >> > I enabled some JAVA GC logs via the "java.env" file: >> > >> > export JVMFLAGS="-Xms1024m -Xmx1024m -XX:+PrintGCDetails >> > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime" >> > >> > and confirmed that the periodic latency is due to JAVA GC operations. >> > >> > For example, below is a 26ms delay which corresponds to a 26ms delay that >> > my test app also saw (it uses the C API and connects to ZK remotely) and >> as >> > also reported by ZK which is the only JAVA app running in the ZK cluster: >> > >> > 2014-02-24T10:29:51.905-0600: [GC [PSYoungGen: 275424K->12128K(305152K)] >> > 325542K->73974K(1004544K), 0.0255720 secs] [Times: user=0.09 sys=0.00, >> > real=0.03 secs] >> > 2014-02-24T10:29:51.931-0600: Total time for which application threads >> were >> > stopped: 0.0261350 seconds >> > >> > JAVA JVM tuning seems to be more of a black art than a science with >> respect >> > to GC and other settings. I was wondering if anyone has any practical >> > advice for JVM settings for the following configuration: >> > >> > a) ZK 3-node cluster running OpenJDK 1.7; ZK is the only app running >> JAVA. >> > b) Application znode data and watches will fit into < 100MB of RAM (say >> > 250k znodes with ~150 bytes per znode with 2 watchers per znode) >> > >> > Consistent and fast read / write latency - say 5ms or less - is critical >> > for the small dataset above. I'm trying to understand if this is >> > obtainable with ZK & JAVA. I realize that other factors come into play >> as >> > well (hardware / network). >> > >> > Thanks in advance for any advice. >> > >> > >> > On Fri, Feb 21, 2014 at 7:51 AM, jmmec <[email protected]> wrote: >> > >> > > Thanks Camille, I definitely understand! :) >> > > >> > > The two questions at the top of mind regarding ZooKeeper are: >> > > 1. How does it calculate latencies? I can dig into its code to see. >> > > 2. Is there anything in particular that might cause it to have the >> spiky >> > > latency I've experienced? I think I ruled out the snapshot behavior by >> > > having a high snapCount. >> > > >> > > Some other things I am planning to explore: >> > > 1. My test software is rightfully suspect, so I'll review it carefully >> > > again and will simplify it further so that it is doing the absolute >> bare >> > > minimum. >> > > 2. I'm running OpenJDK 1.7.0_60-ea so might swap to an earlier and/or >> > > different distribution. >> > > 3. I'm running ZooKeeper 3.4.5 and might fall back to the 3.3.6 >> release. >> > > >> > > Hopefully one of the items above will reveal the root cause. Any other >> > > suggestions are welcome. >> > > >> > > >> > > >> > > On Thu, Feb 20, 2014 at 7:57 PM, Camille Fournier <[email protected] >> > >wrote: >> > > >> > >> I might suggest that you create a personal github and mock up a >> > >> replication >> > >> there :) I understand employers that own your code but unless someone >> > >> knows >> > >> the answer off the top of their head, odds of finding the cause are >> low >> > >> without something that replicates it, and knowing how busy most of us >> > are >> > >> here I don't know that we'll have time to do that for you. >> > >> >> > >> C >> > >> >> > >> >> > >> On Thu, Feb 20, 2014 at 9:41 PM, jmmec <[email protected]> wrote: >> > >> >> > >> > Thanks again, >> > >> > >> > >> > Unfortunately I can't share the test code since it is technically >> the >> > >> > property of my employer. >> > >> > >> > >> > It's very strange behavior. I think I've said that several times >> now. >> > >> > ha... >> > >> > >> > >> > Appreciate any additional help or advice or suggestions from >> everyone >> > >> and >> > >> > anyone and their brother or sister. >> > >> > >> > >> > >> > >> > >> > >> > On Thu, Feb 20, 2014 at 8:10 PM, Camille Fournier < >> [email protected] >> > >> > >wrote: >> > >> > >> > >> > > Can you share the test code somewhere (github maybe?)? >> > >> > > >> > >> > > Thanks, >> > >> > > C >> > >> > > >> > >> > > >> > >> > > On Thu, Feb 20, 2014 at 9:08 PM, jmmec <[email protected]> >> wrote: >> > >> > > >> > >> > > > Thanks for the quick reply. >> > >> > > > >> > >> > > > I did not try the "slow" test using a normal disk drive, >> however I >> > >> > first >> > >> > > > discovered this problem when writing to a 7200RPM disk drive at >> a >> > >> much >> > >> > > > higher messaging rate (e.g. 1500 to 3000 creates/sec rather than >> > 84 >> > >> > > > creates/sec). This is what caused me to start simplifying the >> > >> > > > configuration trying to find the root cause. As part of that >> > >> > > > investigation, I created a RAM disk to avoid the hard drive, but >> > the >> > >> > hard >> > >> > > > drive wasn't the problem. I just haven't switched back to the >> > hard >> > >> > > drive. >> > >> > > > >> > >> > > > I don't know what ZooKeeper is doing internally, or how & why it >> > is >> > >> > > > deriving 76ms MAX latency. The very regular periodic pattern >> > >> suggests >> > >> > > > something odd. >> > >> > > > >> > >> > > > Hmmmm..... >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >>
