fwiw I wasn't saying it wasn't gc btw. (I didn't look very deep to be honest) I was just saying that if it is gc it should be pretty easy to identify. turn on verbose logging or use jstat, etc...
e.g. http://www.cubrid.org/blog/dev-platform/how-to-monitor-java-garbage-collection/ Patrick On Tue, Feb 25, 2014 at 8:56 AM, kishore g <[email protected]> wrote: > My bad, i dint really see the GC log entry. I agree with Camille and Phunt > that the GC is unlikely cause here. Another thing to consider is; do the > latency spikes correlate to snapshot file generation. Are you writing the > snapshots to RAMDisk or spinning disk. > > Without additional info/code, we can only guess. > > thanks, > Kishore G > > > On Mon, Feb 24, 2014 at 3:18 PM, Patrick Hunt <[email protected]> wrote: > >> fwiw I've used strace in the past for things like this (if you were to >> rule out GC, I believe it could easily be GC in this case but that's >> easy to identify). Had a nasty ssd write latency issue that we just >> couldn't figure out. Using strace we were able to see that fsyncs were >> taking a really long time in some cases. (bad disk firmware most >> likely). YMMV but strace with grep/wc/awk/gnuplot is great for this >> stuff. >> >> Patrick >> >> On Mon, Feb 24, 2014 at 2:13 PM, Camille Fournier <[email protected]> >> wrote: >> > You can try CMS. I don't think gc should be causing you pauses unless >> it's >> > actually cleaning old gen, eden GC should be pauseless. You can tune the >> > pool sizes to have enough space in old gen so you won't need pauses. The >> > log you have printed above seems to be indicating the pauseless GC so I'm >> > surprised it is causing noticeable performance degredation. But GC >> > ergonomics aren't that hard to manage, especially in an application that >> > should be used within existing process memory. Have you tried running >> this >> > on an oracle JDK instead of OpenJDK? >> > >> > C >> > >> > >> > On Mon, Feb 24, 2014 at 3:34 PM, kishore g <[email protected]> wrote: >> > >> >> try CMS garbage collector and see if it improves. I think you are great >> at >> >> debugging, being new to JAVA and ZK, you were able to correlate GC >> activity >> >> with latency spikes. Kudos for that. >> >> >> >> Try the following JVM Flags. >> >> >> >> -server -Xms<> -Xmx<> -XX:NewSize=<> -XX:MaxNewSize=<> >> >> -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 >> >> >> >> If you use disk as backing store, i dont think you can get a consistent >> >> read/write of 5ms. There are lot of limitations in the design (most of >> them >> >> are there to ensure consistency, for example every writes ensure that >> >> transaction log is fsynced before acknowledging to the client). >> >> >> >> RAM disk might give your performance but you need to be prepared for the >> >> catastrophic scenario where all zookeepers go down. >> >> >> >> thanks, >> >> Kishore G >> >> >> >> >> >> >> >> On Mon, Feb 24, 2014 at 9:36 AM, jmmec <[email protected]> wrote: >> >> >> >> > Hey everyone, >> >> > >> >> > Did I mention that I'm a newbie to ZooKeeper and also to JAVA? :) >> >> > >> >> > I enabled some JAVA GC logs via the "java.env" file: >> >> > >> >> > export JVMFLAGS="-Xms1024m -Xmx1024m -XX:+PrintGCDetails >> >> > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime" >> >> > >> >> > and confirmed that the periodic latency is due to JAVA GC operations. >> >> > >> >> > For example, below is a 26ms delay which corresponds to a 26ms delay >> that >> >> > my test app also saw (it uses the C API and connects to ZK remotely) >> and >> >> as >> >> > also reported by ZK which is the only JAVA app running in the ZK >> cluster: >> >> > >> >> > 2014-02-24T10:29:51.905-0600: [GC [PSYoungGen: >> 275424K->12128K(305152K)] >> >> > 325542K->73974K(1004544K), 0.0255720 secs] [Times: user=0.09 sys=0.00, >> >> > real=0.03 secs] >> >> > 2014-02-24T10:29:51.931-0600: Total time for which application threads >> >> were >> >> > stopped: 0.0261350 seconds >> >> > >> >> > JAVA JVM tuning seems to be more of a black art than a science with >> >> respect >> >> > to GC and other settings. I was wondering if anyone has any practical >> >> > advice for JVM settings for the following configuration: >> >> > >> >> > a) ZK 3-node cluster running OpenJDK 1.7; ZK is the only app running >> >> JAVA. >> >> > b) Application znode data and watches will fit into < 100MB of RAM >> (say >> >> > 250k znodes with ~150 bytes per znode with 2 watchers per znode) >> >> > >> >> > Consistent and fast read / write latency - say 5ms or less - is >> critical >> >> > for the small dataset above. I'm trying to understand if this is >> >> > obtainable with ZK & JAVA. I realize that other factors come into >> play >> >> as >> >> > well (hardware / network). >> >> > >> >> > Thanks in advance for any advice. >> >> > >> >> > >> >> > On Fri, Feb 21, 2014 at 7:51 AM, jmmec <[email protected]> wrote: >> >> > >> >> > > Thanks Camille, I definitely understand! :) >> >> > > >> >> > > The two questions at the top of mind regarding ZooKeeper are: >> >> > > 1. How does it calculate latencies? I can dig into its code to see. >> >> > > 2. Is there anything in particular that might cause it to have the >> >> spiky >> >> > > latency I've experienced? I think I ruled out the snapshot >> behavior by >> >> > > having a high snapCount. >> >> > > >> >> > > Some other things I am planning to explore: >> >> > > 1. My test software is rightfully suspect, so I'll review it >> carefully >> >> > > again and will simplify it further so that it is doing the absolute >> >> bare >> >> > > minimum. >> >> > > 2. I'm running OpenJDK 1.7.0_60-ea so might swap to an earlier >> and/or >> >> > > different distribution. >> >> > > 3. I'm running ZooKeeper 3.4.5 and might fall back to the 3.3.6 >> >> release. >> >> > > >> >> > > Hopefully one of the items above will reveal the root cause. Any >> other >> >> > > suggestions are welcome. >> >> > > >> >> > > >> >> > > >> >> > > On Thu, Feb 20, 2014 at 7:57 PM, Camille Fournier < >> [email protected] >> >> > >wrote: >> >> > > >> >> > >> I might suggest that you create a personal github and mock up a >> >> > >> replication >> >> > >> there :) I understand employers that own your code but unless >> someone >> >> > >> knows >> >> > >> the answer off the top of their head, odds of finding the cause are >> >> low >> >> > >> without something that replicates it, and knowing how busy most of >> us >> >> > are >> >> > >> here I don't know that we'll have time to do that for you. >> >> > >> >> >> > >> C >> >> > >> >> >> > >> >> >> > >> On Thu, Feb 20, 2014 at 9:41 PM, jmmec <[email protected]> >> wrote: >> >> > >> >> >> > >> > Thanks again, >> >> > >> > >> >> > >> > Unfortunately I can't share the test code since it is technically >> >> the >> >> > >> > property of my employer. >> >> > >> > >> >> > >> > It's very strange behavior. I think I've said that several times >> >> now. >> >> > >> > ha... >> >> > >> > >> >> > >> > Appreciate any additional help or advice or suggestions from >> >> everyone >> >> > >> and >> >> > >> > anyone and their brother or sister. >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > On Thu, Feb 20, 2014 at 8:10 PM, Camille Fournier < >> >> [email protected] >> >> > >> > >wrote: >> >> > >> > >> >> > >> > > Can you share the test code somewhere (github maybe?)? >> >> > >> > > >> >> > >> > > Thanks, >> >> > >> > > C >> >> > >> > > >> >> > >> > > >> >> > >> > > On Thu, Feb 20, 2014 at 9:08 PM, jmmec <[email protected]> >> >> wrote: >> >> > >> > > >> >> > >> > > > Thanks for the quick reply. >> >> > >> > > > >> >> > >> > > > I did not try the "slow" test using a normal disk drive, >> >> however I >> >> > >> > first >> >> > >> > > > discovered this problem when writing to a 7200RPM disk drive >> at >> >> a >> >> > >> much >> >> > >> > > > higher messaging rate (e.g. 1500 to 3000 creates/sec rather >> than >> >> > 84 >> >> > >> > > > creates/sec). This is what caused me to start simplifying >> the >> >> > >> > > > configuration trying to find the root cause. As part of that >> >> > >> > > > investigation, I created a RAM disk to avoid the hard drive, >> but >> >> > the >> >> > >> > hard >> >> > >> > > > drive wasn't the problem. I just haven't switched back to >> the >> >> > hard >> >> > >> > > drive. >> >> > >> > > > >> >> > >> > > > I don't know what ZooKeeper is doing internally, or how & >> why it >> >> > is >> >> > >> > > > deriving 76ms MAX latency. The very regular periodic pattern >> >> > >> suggests >> >> > >> > > > something odd. >> >> > >> > > > >> >> > >> > > > Hmmmm..... >> >> > >> > > > >> >> > >> > > >> >> > >> > >> >> > >> >> >> > > >> >> > > >> >> > >> >> >>
