Re: gc pause killing regionserver

Ferdy Galema Tue, 06 Mar 2012 02:41:41 -0800

Correction 2300300300 is the current heap size for regionservers (I misread
another process.)


On Tue, Mar 6, 2012 at 11:37 AM, Ferdy Galema <[email protected]>wrote:

> Thanks for the replies. That is a lot of useful information. I admit that
> I'm still running a bit behind when it comes to truly understanding GC
> mechanics and how HBase deals with it. I find the following resource also
> useful for understanding GC:
>
> https://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf
>
>
> > Why use the following in your config?
> I use these GC tuning options because I found them somewere on the mailing
> list advertised as generally advised GC options. I think it would be nice
> if HBase ref guide recommends default GC settings, I can imagine that they
> are different for different heap sizes.
>
>
> > It's possible that you actually are in swap due to uncollected off-heap
> > memory allocations.  I doubt that even severe fragmentation on the heap
> > would cause that kind of slowdown
> Munin does show some minor swapping (and memory overcommit), but
> considering the amount of free space left (os disk cache) and the fact that
> swapiness is set to zero, I was under the impression that it was harmless.
> On second thought I will deep digger into this.
>
>
> > If I'm reading your log correctly, you have about 2.5 GB of heap, right?
> That's right, 2100100100 bytes to be exact.
>
>
> > Is this server has the same load as the other ones?
> Yes. They all run about the same amount of regions and generally have the
> same load. The hardware is (should be) identical.
>
> Ferdy.
>
>
>
> On Mon, Mar 5, 2012 at 10:38 PM, Mikael Sitruk <[email protected]>wrote:
>
>> Try to set the initialOccupancy to a lower value and try to allocate more
>> space to the new generation space.
>> It seems that you don't have enough place in the survivor space, and
>> therefore you have a promotion failure.
>>
>> You also mention that the same server is frequently having this problem,
>> is
>> it because of your data spread? Is this server has the same load as the
>> other ones?
>>
>>
>>
>> On Mon, Mar 5, 2012 at 11:00 PM, Sandy Pratt <[email protected]> wrote:
>>
>> > What was the actual process size of the JVM as reported by top?
>> >
>> > Why use the following in your config?
>> >
>> > -XX:NewRatio=16
>> > -XX:MaxGCPauseMillis=100
>> >
>> > Do you really have a stringent latency target, or are you just being
>> > aggressive?
>> >
>> > If I'm reading your log correctly, you have about 2.5 GB of heap, right?
>> >  If so, your full GC times are too long by over an order of magnitude.
>> >  Here is a snippet from one of my servers for comparison:
>> >
>> > 2393278.112: [GC [PSYoungGen: 453884K->221141K(466048K)]
>> > 1716800K->1510206K(1864192K), 0.1394090 secs] [Times: user=0.55
>> sys=0.00,
>> > real=0.14 secs]
>> > 2393281.457: [GC [PSYoungGen: 454229K->181061K(466048K)]
>> > 1743294K->1529614K(1864192K), 0.1377460 secs] [Times: user=0.54
>> sys=0.00,
>> > real=0.13 secs]
>> > 2393281.595: [Full GC [PSYoungGen: 181061K->0K(466048K)] [ParOldGen:
>> > 1348552K->985640K(1398144K)] 1529614K->985640K(1864192K) [PSPermGen:
>> > 21228K->21225K(21504K)], 0.6138910 secs] [Times: user=2.16 sys=0.02,
>> > real=0.62 secs]
>> > 2393285.890: [GC [PSYoungGen: 233088K->33764K(466048K)]
>> > 1218728K->1019404K(1864192K), 0.0228820 secs] [Times: user=0.09
>> sys=0.00,
>> > real=0.02 secs]
>> > ...
>> > 2393323.370: [GC [PSYoungGen: 272556K->32976K(466048K)]
>> > 1581838K->1381385K(1864192K), 0.0444630 secs] [Times: user=0.18
>> sys=0.00,
>> > real=0.04 secs]
>> > 2393323.415: [Full GC [PSYoungGen: 32976K->0K(466048K)] [ParOldGen:
>> > 1348409K->975392K(1398144K)] 1381385K->975392K(1864192K) [PSPermGen:
>> > 21225K->21225K(21504K)], 0.5755400 secs] [Times: user=2.08 sys=0.00,
>> > real=0.58 secs]
>> > 2393326.708: [GC [PSYoungGen: 233088K->32236K(470528K)]
>> > 1208480K->1007628K(1868672K), 0.0222720 secs] [Times: user=0.08
>> sys=0.01,
>> > real=0.02 secs]
>> > 2393329.372: [GC [PSYoungGen: 269804K->30186K(466048K)]
>> > 1245196K->1037366K(1864192K), 0.0377520 secs] [Times: user=0.14
>> sys=0.00,
>> > real=0.04 secs]
>> >
>> > Granted, your heap is a little larger, but not that much larger.
>> >
>> > ~600 ms might be too long to wait for you, so you might not want to use
>> > the parallel compacting collector.  Still, it seems that even when you
>> fall
>> > back to the serial collector during concurrent mode failure, your pauses
>> > should be more in the 3-5 second range (just guessing -- see user times
>> > above).
>> >
>> > It's possible that you actually are in swap due to uncollected off-heap
>> > memory allocations.  I doubt that even severe fragmentation on the heap
>> > would cause that kind of slowdown.  Remember, the JVM memory footprint
>> is
>> > not only Java heap.  In fact, I have recently seen situations with both
>> > JBoss and HBase where heap size is a small fraction of JVM process size.
>> >  This seems to occur most often in lightly loaded installations.  It
>> might
>> > be interesting to have a look around your processes with pmap or some
>> tool
>> > like that.  In many situations, you'll probably be able to account for
>> each
>> > Java heap generation allocation easily enough, then be able to identify
>> a
>> > succession of other allocations that aren't part of the Java heap.
>>  Those
>> > latter parts are likely to be your problem, if my guess is on the money.
>> >
>> > There's a little info about using pmap against the JVM here:
>> >
>> > http://www.kdgregory.com/index.php?page=java.outOfMemory
>> >
>> >
>> >
>> >
>> >
>> > > -----Original Message-----
>> > > From: Ferdy Galema [mailto:[email protected]]
>> > > Sent: Saturday, March 03, 2012 05:05
>> > > To: [email protected]
>> > > Subject: gc pause killing regionserver
>> > >
>> > > Hi,
>> > >
>> > > I'm running regionservers with 2GB heap and following tuning options:
>> > > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16
>> > > -XX:CMSInitiatingOccupancyFraction=70 -
>> > > XX:+UseCMSInitiatingOccupancyOnly
>> > > -XX:MaxGCPauseMillis=100
>> > >
>> > > A regionserver aborted (YouAreDeadException) and this was printed in
>> the
>> > > gc logs (all is shown up until the abort)
>> > >
>> > > 211663.516: [GC 211663.516: [ParNew: 118715K->13184K(118912K),
>> 0.0445390
>> > > secs] 1373940K->1289814K(2233472K), 0.0446420 secs] [Times: user=0.14
>> > > sys=0.01, real=0.05 secs]
>> > > 211663.686: [GC 211663.686: [ParNew: 118912K->13184K(118912K),
>> 0.0594280
>> > > secs] 1395542K->1310185K(2233472K), 0.0595420 secs] [Times: user=0.15
>> > > sys=0.00, real=0.06 secs]
>> > > 211663.869: [GC 211663.869: [ParNew: 118790K->13184K(118912K),
>> 0.0434820
>> > > secs] 1415792K->1331317K(2233472K), 0.0435930 secs] [Times: user=0.13
>> > > sys=0.01, real=0.04 secs]
>> > > 211667.598: [GC 211667.598: [ParNew (promotion failed):
>> > > 118912K->118912K(118912K), 0.0225390 secs]211667.621: [CMS:
>> > > 1330845K->1127914K(2114560K), 51.3610670 secs]
>> > > 1437045K->1127914K(2233472K), [CMS Perm : 20680K->20622K(34504K)],
>> > > 51.3838170 secs] [Times: user=1.82 sys=0.31, real=51.38 secs]
>> > > 211719.713: [GC 211719.714: [ParNew: 105723K->13184K(118912K),
>> 0.0176130
>> > > secs] 1233638K->1149393K(2233472K), 0.0177230 secs] [Times: user=0.07
>> > > sys=0.00, real=0.02 secs]
>> > > 211719.851: [GC 211719.852: [ParNew: 118912K->13184K(118912K),
>> 0.0281860
>> > > secs] 1255121K->1170269K(2233472K), 0.0282970 secs] [Times: user=0.10
>> > > sys=0.01, real=0.03 secs]
>> > > 211719.993: [GC 211719.993: [ParNew: 118795K->13184K(118912K),
>> 0.0276320
>> > > secs] 1275880K->1191268K(2233472K), 0.0277350 secs] [Times: user=0.09
>> > > sys=0.00, real=0.03 secs]
>> > > 211720.490: [GC 211720.490: [ParNew: 118912K->13184K(118912K),
>> 0.0624650
>> > > secs] 1296996K->1210640K(2233472K), 0.0625560 secs] [Times: user=0.15
>> > > sys=0.00, real=0.06 secs]
>> > > 211720.687: [GC 211720.687: [ParNew: 118702K->13184K(118912K),
>> 0.1651750
>> > > secs] 1316159K->1231993K(2233472K), 0.1652660 secs] [Times: user=0.25
>> > > sys=0.01, real=0.17 secs]
>> > > 211721.038: [GC 211721.038: [ParNew: 118912K->13184K(118912K),
>> 0.0952750
>> > > secs] 1337721K->1252598K(2233472K), 0.0953660 secs] [Times: user=0.15
>> > > sys=0.00, real=0.09 secs] Heap  par new generation  total 118912K,
>> used
>> > > 86199K [0x00002aaaae1f0000, 0x00002aaab62f0000, 0x00002aaab62f0000)
>> > >   eden space 105728K,  69% used [0x00002aaaae1f0000,
>> 0x00002aaab293dfa8,
>> > > 0x00002aaab4930000)
>> > >   from space 13184K, 100% used [0x00002aaab4930000,
>> 0x00002aaab5610000,
>> > > 0x00002aaab5610000)
>> > >   to  space 13184K,  0% used [0x00002aaab5610000, 0x00002aaab5610000,
>> > > 0x00002aaab62f0000)
>> > >  concurrent mark-sweep generation total 2114560K, used 1239414K
>> > > [0x00002aaab62f0000, 0x00002aab373f0000, 0x00002aab373f0000)
>> > > concurrent-mark-sweep perm gen total 34504K, used 20728K
>> > > [0x00002aab373f0000, 0x00002aab395a2000, 0x00002aab3c7f0000)
>> > >
>> > >
>> > > Why did a GC took 51 seconds? The machine still had enough memory
>> > > available so it could not be swapping. (swapiness is set to 0). From
>> the
>> > 15
>> > > regionservers in total, I often see this specific regionserver fail.
>> > What do you
>> > > recommended in this situation?
>> > >
>> > > Ferdy.
>> >
>>
>>
>>
>> --
>> Mikael.S
>>
>
>

Re: gc pause killing regionserver

Reply via email to