What was the actual process size of the JVM as reported by top? Why use the following in your config?
-XX:NewRatio=16 -XX:MaxGCPauseMillis=100 Do you really have a stringent latency target, or are you just being aggressive? If I'm reading your log correctly, you have about 2.5 GB of heap, right? If so, your full GC times are too long by over an order of magnitude. Here is a snippet from one of my servers for comparison: 2393278.112: [GC [PSYoungGen: 453884K->221141K(466048K)] 1716800K->1510206K(1864192K), 0.1394090 secs] [Times: user=0.55 sys=0.00, real=0.14 secs] 2393281.457: [GC [PSYoungGen: 454229K->181061K(466048K)] 1743294K->1529614K(1864192K), 0.1377460 secs] [Times: user=0.54 sys=0.00, real=0.13 secs] 2393281.595: [Full GC [PSYoungGen: 181061K->0K(466048K)] [ParOldGen: 1348552K->985640K(1398144K)] 1529614K->985640K(1864192K) [PSPermGen: 21228K->21225K(21504K)], 0.6138910 secs] [Times: user=2.16 sys=0.02, real=0.62 secs] 2393285.890: [GC [PSYoungGen: 233088K->33764K(466048K)] 1218728K->1019404K(1864192K), 0.0228820 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] ... 2393323.370: [GC [PSYoungGen: 272556K->32976K(466048K)] 1581838K->1381385K(1864192K), 0.0444630 secs] [Times: user=0.18 sys=0.00, real=0.04 secs] 2393323.415: [Full GC [PSYoungGen: 32976K->0K(466048K)] [ParOldGen: 1348409K->975392K(1398144K)] 1381385K->975392K(1864192K) [PSPermGen: 21225K->21225K(21504K)], 0.5755400 secs] [Times: user=2.08 sys=0.00, real=0.58 secs] 2393326.708: [GC [PSYoungGen: 233088K->32236K(470528K)] 1208480K->1007628K(1868672K), 0.0222720 secs] [Times: user=0.08 sys=0.01, real=0.02 secs] 2393329.372: [GC [PSYoungGen: 269804K->30186K(466048K)] 1245196K->1037366K(1864192K), 0.0377520 secs] [Times: user=0.14 sys=0.00, real=0.04 secs] Granted, your heap is a little larger, but not that much larger. ~600 ms might be too long to wait for you, so you might not want to use the parallel compacting collector. Still, it seems that even when you fall back to the serial collector during concurrent mode failure, your pauses should be more in the 3-5 second range (just guessing -- see user times above). It's possible that you actually are in swap due to uncollected off-heap memory allocations. I doubt that even severe fragmentation on the heap would cause that kind of slowdown. Remember, the JVM memory footprint is not only Java heap. In fact, I have recently seen situations with both JBoss and HBase where heap size is a small fraction of JVM process size. This seems to occur most often in lightly loaded installations. It might be interesting to have a look around your processes with pmap or some tool like that. In many situations, you'll probably be able to account for each Java heap generation allocation easily enough, then be able to identify a succession of other allocations that aren't part of the Java heap. Those latter parts are likely to be your problem, if my guess is on the money. There's a little info about using pmap against the JVM here: http://www.kdgregory.com/index.php?page=java.outOfMemory > -----Original Message----- > From: Ferdy Galema [mailto:[email protected]] > Sent: Saturday, March 03, 2012 05:05 > To: [email protected] > Subject: gc pause killing regionserver > > Hi, > > I'm running regionservers with 2GB heap and following tuning options: > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16 > -XX:CMSInitiatingOccupancyFraction=70 - > XX:+UseCMSInitiatingOccupancyOnly > -XX:MaxGCPauseMillis=100 > > A regionserver aborted (YouAreDeadException) and this was printed in the > gc logs (all is shown up until the abort) > > 211663.516: [GC 211663.516: [ParNew: 118715K->13184K(118912K), 0.0445390 > secs] 1373940K->1289814K(2233472K), 0.0446420 secs] [Times: user=0.14 > sys=0.01, real=0.05 secs] > 211663.686: [GC 211663.686: [ParNew: 118912K->13184K(118912K), 0.0594280 > secs] 1395542K->1310185K(2233472K), 0.0595420 secs] [Times: user=0.15 > sys=0.00, real=0.06 secs] > 211663.869: [GC 211663.869: [ParNew: 118790K->13184K(118912K), 0.0434820 > secs] 1415792K->1331317K(2233472K), 0.0435930 secs] [Times: user=0.13 > sys=0.01, real=0.04 secs] > 211667.598: [GC 211667.598: [ParNew (promotion failed): > 118912K->118912K(118912K), 0.0225390 secs]211667.621: [CMS: > 1330845K->1127914K(2114560K), 51.3610670 secs] > 1437045K->1127914K(2233472K), [CMS Perm : 20680K->20622K(34504K)], > 51.3838170 secs] [Times: user=1.82 sys=0.31, real=51.38 secs] > 211719.713: [GC 211719.714: [ParNew: 105723K->13184K(118912K), 0.0176130 > secs] 1233638K->1149393K(2233472K), 0.0177230 secs] [Times: user=0.07 > sys=0.00, real=0.02 secs] > 211719.851: [GC 211719.852: [ParNew: 118912K->13184K(118912K), 0.0281860 > secs] 1255121K->1170269K(2233472K), 0.0282970 secs] [Times: user=0.10 > sys=0.01, real=0.03 secs] > 211719.993: [GC 211719.993: [ParNew: 118795K->13184K(118912K), 0.0276320 > secs] 1275880K->1191268K(2233472K), 0.0277350 secs] [Times: user=0.09 > sys=0.00, real=0.03 secs] > 211720.490: [GC 211720.490: [ParNew: 118912K->13184K(118912K), 0.0624650 > secs] 1296996K->1210640K(2233472K), 0.0625560 secs] [Times: user=0.15 > sys=0.00, real=0.06 secs] > 211720.687: [GC 211720.687: [ParNew: 118702K->13184K(118912K), 0.1651750 > secs] 1316159K->1231993K(2233472K), 0.1652660 secs] [Times: user=0.25 > sys=0.01, real=0.17 secs] > 211721.038: [GC 211721.038: [ParNew: 118912K->13184K(118912K), 0.0952750 > secs] 1337721K->1252598K(2233472K), 0.0953660 secs] [Times: user=0.15 > sys=0.00, real=0.09 secs] Heap par new generation total 118912K, used > 86199K [0x00002aaaae1f0000, 0x00002aaab62f0000, 0x00002aaab62f0000) > eden space 105728K, 69% used [0x00002aaaae1f0000, 0x00002aaab293dfa8, > 0x00002aaab4930000) > from space 13184K, 100% used [0x00002aaab4930000, 0x00002aaab5610000, > 0x00002aaab5610000) > to space 13184K, 0% used [0x00002aaab5610000, 0x00002aaab5610000, > 0x00002aaab62f0000) > concurrent mark-sweep generation total 2114560K, used 1239414K > [0x00002aaab62f0000, 0x00002aab373f0000, 0x00002aab373f0000) > concurrent-mark-sweep perm gen total 34504K, used 20728K > [0x00002aab373f0000, 0x00002aab395a2000, 0x00002aab3c7f0000) > > > Why did a GC took 51 seconds? The machine still had enough memory > available so it could not be swapping. (swapiness is set to 0). From the 15 > regionservers in total, I often see this specific regionserver fail. What do > you > recommended in this situation? > > Ferdy.
