Just wanted to add to Todd's explanation this link:
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html (Java
SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning).
It gives more detailed (to some extent of course, on this deep topic)
description on what Todd mentioned.

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Wed, Nov 24, 2010 at 10:34 PM, Todd Lipcon <t...@cloudera.com> wrote:

> On Wed, Nov 24, 2010 at 7:01 AM, Sean Sechrist <ssechr...@gmail.com>
> wrote:
>
> > Hey guys,
> >
> > I just want to get an idea about how everyone avoids these long GC pauses
> > that cause regionservers to die.
> >
> > What kind of java heap and garbage collection settings do you use?
> >
> > What do you do to make sure that the HBase vm never uses swap? I have
> heard
> > turning off swap altogether can be dangerous, so right now we have the
> > setting vm.swappiness=0. How do you tell if it's using swap? On Ganglia,
> we
> > see the "CPU wio" metric at around 4.5% before one of our crashes. Is
> that
> > high?
> >
> > To try to avoid using too much memory, is reducing the memstore
> upper/lower
> > limit, or the block cache size a good idea? Should we just tune down
> > HBase's
> > total heap to try to avoid swap?
> >
> > In terms of our specific problem:
> >
> > We seem to keep running into garbage collection pauses that cause the
> > regionservers to die. We have mix of some random read jobs, as well as a
> > few
> > full-scan jobs (~1.5 billion rows, 800-900GB of data, 1500 regions), and
> we
> > are always inserting data. We would rather sacrifice a little speed for
> > stability, if that means anything. We have 7 nodes (RS + DN + TT) with
> 12GB
> > max heap given to HBase, and 24GB memory total.
> >
> > We were using the following garbage collection options:
> > -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m
> > -XX:CMSInitiatingOccupancyFraction=75
> >
> > After looking at http://wiki.apache.org/hadoop/PerformanceTuning, we are
> > trying to lower NewSize/MaxNewSize to 6m as well as reducing
> > CMSInitiatingOccupancyFraction to 50.
> >
>
> Rather than reducing the new size, you should consider increasing new size
> if you're OK with higher latency but fewer long GC pauses.
>
> GC is a complicated subject, but here are a few rules of thumb:
>
> - A larger young generation means that the young GC pauses, which are
> stop-the-world, will take longer. In my experience it's somewhere around 1
> second per GB of new size. So, if you're OK with periodic 1 second pauses,
> a
> large (1GB) new size should be fine.
> - A larger young generation also means that less data will get tenured to
> the old generation. This means that the old generation will have to collect
> less often and also that it will become less fragmented.
> - In HBase, the long (45second+) pauses generally happen when promotion
> fails due to heap fragmentation in the old generation. So, it falls back to
> stop-the-world compacting collection which takes a long time.
>
> So, in general, a large young gen will reduce the frequency of super-long
> pauses, but will increase the frequency of shorter pauses.
>
> It sounds like you may be OK with longer young gen pauses, so maybe
> consider
> new size at 512M with your 12G total heap?
>
> I also wouldn't tune CMSInitiatingOccupancy below 60% - that will cause CMS
> to always be running which isn't that efficient.
>
> -Todd
>
>
> >
> > We see messages like this in our GC logs:
> >>
> >> 8.844/17.169 secs] [Times: user=75.16 sys=1.34, real=17.17 secs]
> >
> >  (concurrent mode failure): 10126729K->5760080K(13246464K), 91.2530340
> >> secs]
> >
> >
> >
> > 2010-11-23T14:56:01.383-0500: 61297.449: [GC 61297.449: [ParNew
> (promotion
> > failed): 57425K->57880K(59008K), 0.1880950 secs]61297.637:
> > [CMS2010-11-23T14:56:06.336-0500: 61302.402: [CMS-concurrent-mark:
> > 8.844/17.169 secs] [Times: user=75.16 sys=1.34, real=17.17 secs]
> >  (concurrent mode failure): 10126729K->5760080K(13246464K), 91.2530340
> > secs]
> > 10181961K->5760080K(13305472K), [CMS Perm : 20252K->20241K(33868K)],
> > 91.4413320 secs] [Times: user=24.47 sys=1.07, real=91.44 secs]
> >
> > There's a lot of questions there, but I definitely appreciate any advice
> or
> > input anybody else has. Thanks so much!
> >
> > -Sean
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to