Re: ZK 3.4.5: Very Strange Write Latency Problem?

Patrick Hunt Mon, 24 Feb 2014 15:20:06 -0800

fwiw I've used strace in the past for things like this (if you were to
rule out GC, I believe it could easily be GC in this case but that's
easy to identify). Had a nasty ssd write latency issue that we just
couldn't figure out. Using strace we were able to see that fsyncs were
taking a really long time in some cases. (bad disk firmware most
likely). YMMV but strace with grep/wc/awk/gnuplot is great for this
stuff.


Patrick

On Mon, Feb 24, 2014 at 2:13 PM, Camille Fournier <[email protected]> wrote:
> You can try CMS. I don't think gc should be causing you pauses unless it's
> actually cleaning old gen, eden GC should be pauseless. You can tune the
> pool sizes to have enough space in old gen so you won't need pauses. The
> log you have printed above seems to be indicating the pauseless GC so I'm
> surprised it is causing noticeable performance degredation. But GC
> ergonomics aren't that hard to manage, especially in an application that
> should be used within existing process memory. Have you tried running this
> on an oracle JDK instead of OpenJDK?
>
> C
>
>
> On Mon, Feb 24, 2014 at 3:34 PM, kishore g <[email protected]> wrote:
>
>> try CMS garbage collector and see if it improves. I think you are great at
>> debugging, being new to JAVA and ZK, you were able to correlate GC activity
>> with latency spikes. Kudos for that.
>>
>> Try the following JVM Flags.
>>
>> -server -Xms<> -Xmx<> -XX:NewSize=<> -XX:MaxNewSize=<>
>> -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70
>>
>> If you use disk as backing store, i dont think you can get a consistent
>> read/write of 5ms. There are lot of limitations in the design (most of them
>> are there to ensure consistency, for example every writes ensure that
>> transaction log is fsynced before acknowledging to the client).
>>
>> RAM disk might give your performance but you need to be prepared for the
>> catastrophic scenario where all zookeepers go down.
>>
>> thanks,
>> Kishore G
>>
>>
>>
>> On Mon, Feb 24, 2014 at 9:36 AM, jmmec <[email protected]> wrote:
>>
>> > Hey everyone,
>> >
>> > Did I mention that I'm a newbie to ZooKeeper and also to JAVA?   :)
>> >
>> > I enabled some JAVA GC logs via the "java.env" file:
>> >
>> > export JVMFLAGS="-Xms1024m -Xmx1024m -XX:+PrintGCDetails
>> > -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime"
>> >
>> > and confirmed that the periodic latency is due to JAVA GC operations.
>> >
>> > For example, below is a 26ms delay which corresponds to a 26ms delay that
>> > my test app also saw (it uses the C API and connects to ZK remotely) and
>> as
>> > also reported by ZK which is the only JAVA app running in the ZK cluster:
>> >
>> > 2014-02-24T10:29:51.905-0600: [GC [PSYoungGen: 275424K->12128K(305152K)]
>> > 325542K->73974K(1004544K), 0.0255720 secs] [Times: user=0.09 sys=0.00,
>> > real=0.03 secs]
>> > 2014-02-24T10:29:51.931-0600: Total time for which application threads
>> were
>> > stopped: 0.0261350 seconds
>> >
>> > JAVA JVM tuning seems to be more of a black art than a science with
>> respect
>> > to GC and other settings.  I was wondering if anyone has any practical
>> > advice for JVM settings for the following configuration:
>> >
>> > a) ZK 3-node cluster running OpenJDK 1.7; ZK is the only app running
>> JAVA.
>> > b) Application znode data and watches will fit into < 100MB of RAM (say
>> > 250k znodes with ~150 bytes per znode with 2 watchers per znode)
>> >
>> > Consistent and fast read / write latency - say 5ms or less - is critical
>> > for the small dataset above.  I'm trying to understand if this is
>> > obtainable with ZK & JAVA.  I realize that other factors come into play
>> as
>> > well (hardware / network).
>> >
>> > Thanks in advance for any advice.
>> >
>> >
>> > On Fri, Feb 21, 2014 at 7:51 AM, jmmec <[email protected]> wrote:
>> >
>> > > Thanks Camille, I definitely understand!  :)
>> > >
>> > > The two questions at the top of mind regarding ZooKeeper are:
>> > > 1. How does it calculate latencies?  I can dig into its code to see.
>> > > 2. Is there anything in particular that might cause it to have the
>> spiky
>> > > latency I've experienced?  I think I ruled out the snapshot behavior by
>> > > having a high snapCount.
>> > >
>> > > Some other things I am planning to explore:
>> > > 1. My test software is rightfully suspect, so I'll review it carefully
>> > > again and will simplify it further so that it is doing the absolute
>> bare
>> > > minimum.
>> > > 2. I'm running OpenJDK 1.7.0_60-ea so might swap to an earlier and/or
>> > > different distribution.
>> > > 3. I'm running ZooKeeper 3.4.5 and might fall back to the 3.3.6
>> release.
>> > >
>> > > Hopefully one of the items above will reveal the root cause.  Any other
>> > > suggestions are welcome.
>> > >
>> > >
>> > >
>> > > On Thu, Feb 20, 2014 at 7:57 PM, Camille Fournier <[email protected]
>> > >wrote:
>> > >
>> > >> I might suggest that you create a personal github and mock up a
>> > >> replication
>> > >> there :) I understand employers that own your code but unless someone
>> > >> knows
>> > >> the answer off the top of their head, odds of finding the cause are
>> low
>> > >> without something that replicates it, and knowing how busy most of us
>> > are
>> > >> here I don't know that we'll have time to do that for you.
>> > >>
>> > >> C
>> > >>
>> > >>
>> > >> On Thu, Feb 20, 2014 at 9:41 PM, jmmec <[email protected]> wrote:
>> > >>
>> > >> > Thanks again,
>> > >> >
>> > >> > Unfortunately I can't share the test code since it is technically
>> the
>> > >> > property of my employer.
>> > >> >
>> > >> > It's very strange behavior.  I think I've said that several times
>> now.
>> > >> > ha...
>> > >> >
>> > >> > Appreciate any additional help or advice or suggestions from
>> everyone
>> > >> and
>> > >> > anyone and their brother or sister.
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Thu, Feb 20, 2014 at 8:10 PM, Camille Fournier <
>> [email protected]
>> > >> > >wrote:
>> > >> >
>> > >> > > Can you share the test code somewhere (github maybe?)?
>> > >> > >
>> > >> > > Thanks,
>> > >> > > C
>> > >> > >
>> > >> > >
>> > >> > > On Thu, Feb 20, 2014 at 9:08 PM, jmmec <[email protected]>
>> wrote:
>> > >> > >
>> > >> > > > Thanks for the quick reply.
>> > >> > > >
>> > >> > > > I did not try the "slow" test using a normal disk drive,
>> however I
>> > >> > first
>> > >> > > > discovered this problem when writing to a 7200RPM disk drive at
>> a
>> > >> much
>> > >> > > > higher messaging rate (e.g. 1500 to 3000 creates/sec rather than
>> > 84
>> > >> > > > creates/sec).  This is what caused me to start simplifying the
>> > >> > > > configuration trying to find the root cause.  As part of that
>> > >> > > > investigation, I created a RAM disk to avoid the hard drive, but
>> > the
>> > >> > hard
>> > >> > > > drive wasn't the problem.  I just haven't switched back to the
>> > hard
>> > >> > > drive.
>> > >> > > >
>> > >> > > > I don't know what ZooKeeper is doing internally, or how & why it
>> > is
>> > >> > > > deriving 76ms MAX latency.  The very regular periodic pattern
>> > >> suggests
>> > >> > > > something odd.
>> > >> > > >
>> > >> > > > Hmmmm.....
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>

Re: ZK 3.4.5: Very Strange Write Latency Problem?

Reply via email to