I've been working on improving GC settings for HBase RegionServer instances, and I seem to have run into a bit of a dead end.
Basically I've been trying to tune GC settings and memory settings appropriately, but I keep on having my server reach something like GC Death. My current memory settings are HEAPSIZE=24576 -ea -Xmx24576 -Xms24576 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:NewSize=192m -XX:MaxNewSize=192m -XX:CMSInitiatingOccupancyFraction=60 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log -XX:ParallelGCThreads=6 We've also set hfile.block.cache.size to 0.5 ( believing that incresing the space for cache should improve performance; I'm willing to accept that this could be causing problems, I just haven't seen this reported) Basically as the service is "warming up", we see reasonable GC Log timings. ... 2011-12-06T10:19:14.834+0000: 782.473: [GC [1 CMS-initial-mark: 2225311K(3705244K)] 2237042K(3882204K), 0.0028630 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 2011-12-06T10:19:14.837+0000: 782.476: [CMS-concurrent-mark-start] 2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-mark: 0.686/0.686 secs] [Times: user=1.98 sys=0.05, real=0.69 secs] 2011-12-06T10:19:15.523+0000: 783.162: [CMS-concurrent-preclean-start] 2011-12-06T10:19:15.533+0000: 783.172: [CMS-concurrent-preclean: 0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 2011-12-06T10:19:15.534+0000: 783.172: [CMS-concurrent-abortable-preclean-start] 2011-12-06T10:19:15.742+0000: 783.381: [GC 783.381: [ParNew: 168978K->9067K(176960K), 0.0076540 secs] 2394290K->2235970K(3882204K), 0.0077920 secs] [Times: user=0.05 sys=0.00, real=0.00 secs] 2011-12-06T10:19:16.351+0000: 783.990: [CMS-concurrent-abortable-preclean: 0.123/0.817 secs] [Times: user=0.87 sys=0.06, real=0.81 secs] 2011-12-06T10:19:16.352+0000: 783.990: [GC[YG occupancy: 97548 K (176960 K)]783.990: [Rescan (parallel) , 0.0153180 secs]784.006: [weak refs processing, 0.0000170 secs] [1 CMS-remark: 2226903K(3705244K)] 2324451K(3882204K), 0.0154470 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] ... This seems to continue for about for about 16 hours But as we hit some critical time, we end up stuck running multiple young GC/second. Note that they at appear to be successful (if I'm reading the logs correctly) 011-12-06T23:49:42.132+0000: 49409.770: [GC 49409.771: [ParNew: 166477K->7992K(176960K), 0.0390800 secs] 11642960K->11486691K(19297180K), 0.0392470 secs] [Times: user=0.22 sys=0.00, real=0.04 secs] 2011-12-06T23:49:42.522+0000: 49410.161: [CMS-concurrent-abortable-preclean: 0.247/0.589 secs] [Times: user=1.04 sys=0.02, real=0.58 secs] 2011-12-06T23:49:42.523+0000: 49410.162: [GC[YG occupancy: 149306 K (176960 K)]49410.162: [Rescan (parallel) , 0.0314250 secs]49410.193: [weak refs processing, 0.0000890 secs] [1 CMS-remark: 11478698K(1 9120220K)] 11628005K(19297180K), 0.0316410 secs] [Times: user=0.17 sys=0.01, real=0.03 secs] 2011-12-06T23:49:42.555+0000: 49410.194: [CMS-concurrent-sweep-start] 2011-12-06T23:49:42.597+0000: 49410.236: [GC 49410.236: [ParNew: 165304K->7868K(176960K), 0.0405890 secs] 11643677K->11487303K(19297180K), 0.0407690 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] 2011-12-06T23:49:43.048+0000: 49410.687: [GC 49410.687: [ParNew: 165180K->6485K(176960K), 0.0389860 secs] 11027946K->10869726K(19297180K), 0.0392000 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-sweep: 0.542/0.625 secs] [Times: user=1.73 sys=0.02, real=0.62 secs] 2011-12-06T23:49:43.181+0000: 49410.819: [CMS-concurrent-reset-start] 2011-12-06T23:49:43.232+0000: 49410.870: [CMS-concurrent-reset: 0.051/0.051 secs] [Times: user=0.10 sys=0.00, real=0.05 secs] 2011-12-06T23:49:43.481+0000: 49411.120: [GC 49411.120: [ParNew: 163797K->7150K(176960K), 0.0409170 secs] 10380339K->10224698K(19297180K), 0.0410870 secs] [Times: user=0.24 sys=0.00, real=0.04 secs] 2011-12-06T23:49:43.920+0000: 49411.559: [GC 49411.559: [ParNew: 164462K->8178K(176960K), 0.0394640 secs] 10382010K->10226321K(19297180K), 0.0396290 secs] [Times: user=0.22 sys=0.00, real=0.04 secs] 2011-12-06T23:49:44.355+0000: 49411.994: [GC 49411.994: [ParNew: 165490K->8303K(176960K), 0.0367330 secs] 10383633K->10227244K(19297180K), 0.0368970 secs] [Times: user=0.22 sys=0.00, real=0.03 secs] 2011-12-06T23:49:44.785+0000: 49412.424: [GC 49412.424: [ParNew: 165615K->10439K(176960K), 0.0398080 secs] 10384556K->10229598K(19297180K), 0.0399870 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] 2011-12-06T23:49:45.225+0000: 49412.864: [GC 49412.864: [ParNew: 167751K->13171K(176960K), 0.0393970 secs] 10386910K->10233071K(19297180K), 0.0395730 secs] [Times: user=0.23 sys=0.00, real=0.04 secs] ... Note that we are doing about 2 YGC/second (about 80ms). I believe that this situation represents us using up some undesirable amount of memory resources, but I can't find any particular reason why we're going from using a reasonable amount of GC to using GC constantly. One theory is that this is the result of something in the young generation being promoted? But I'm not really sure what would be causing this pathological behavior. Any ideas are alternative places to look would be greatly appreciated. --Derek