hlog rolled, and qps was at 1000, not much at all. I tend to believe that concurrent GC collection is taking way too many threads from the app layer and causing CPU to run up. Setting up incremental mode should help, I will do so, and report what I find.
-Jack On Mon, May 16, 2011 at 1:48 PM, Stack <[email protected]> wrote: > On Mon, May 16, 2011 at 1:36 PM, Jack Levin <[email protected]> wrote: >> How do you tell? This is the log entries when we had 100% cpu: >> >> 2011-05-14T15:48:58.240-0700: 5128.407: [GC 5128.407: [ParNew: >> 17723K->780K(19136K), 0.0199350 secs] 4309804K->4292973K(5777060K), >> 0.0200660 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] > > This doesn't look bad. Its going from 17.8MB to 0.8MB in 0.02 of real > time. Looks like you new gen. is 19M at the moment which seems fine. > You have set -Xmx == -Xms in your head because it would seem to say > you are using 4.3G only. > > Looking at the rest of your log, we seem to be doing fine too; we're > just cleaning out loads of garbage and we're doing it in young gen -- > which is usually what you want -- and we're doing it pretty quick. > > Whats going on on your servers at the time? You said you took off the > request load but maybe there is a bunch of work queued or we're > running lots of compactions at the time? > > > St.Ack > > >> 2011-05-14T15:48:58.349-0700: 5128.515: [GC 5128.515: [ParNew: >> 17804K->886K(19136K), 0.0188930 secs] 4309997K->4293119K(5777060K), >> 0.0189990 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:58.456-0700: 5128.623: [GC 5128.623: [ParNew: >> 17910K->900K(19136K), 0.0144800 secs] 4310143K->4293237K(5777060K), >> 0.0145890 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:58.563-0700: 5128.729: [GC 5128.729: [ParNew: >> 17924K->945K(19136K), 0.0144940 secs] 4310261K->4293469K(5777060K), >> 0.0146210 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:58.668-0700: 5128.835: [GC 5128.835: [ParNew: >> 17969K->731K(19136K), 0.0149060 secs] 4310493K->4293412K(5777060K), >> 0.0150210 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] >> 2011-05-14T15:48:58.773-0700: 5128.940: [GC 5128.940: [ParNew: >> 17755K->906K(19136K), 0.0248650 secs] 4310436K->4293724K(5777060K), >> 0.0249820 secs] [Times: user=0.07 sys=0.00, real=0.03 secs] >> 2011-05-14T15:48:58.891-0700: 5129.057: [GC 5129.057: [ParNew: >> 17930K->1031K(19136K), 0.0194370 secs] 4310748K->4293961K(5777060K), >> 0.0196490 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.004-0700: 5129.171: [GC 5129.171: [ParNew: >> 18055K->960K(19136K), 0.0145450 secs] 4310985K->4293993K(5777060K), >> 0.0146530 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.115-0700: 5129.281: [GC 5129.282: [ParNew: >> 17984K->1098K(19136K), 0.0143740 secs] 4311017K->4294296K(5777060K), >> 0.0144780 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.224-0700: 5129.391: [GC 5129.391: [ParNew: >> 18122K->867K(19136K), 0.0150360 secs] 4311320K->4294248K(5777060K), >> 0.0151500 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.331-0700: 5129.498: [GC 5129.498: [ParNew: >> 17891K->1011K(19136K), 0.0220280 secs] 4311272K->4294544K(5777060K), >> 0.0221320 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.441-0700: 5129.608: [GC 5129.608: [ParNew: >> 18035K->1064K(19136K), 0.0164780 secs] 4311568K->4294803K(5777060K), >> 0.0165820 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.548-0700: 5129.714: [GC 5129.715: [ParNew: >> 18088K->1193K(19136K), 0.0141640 secs] 4311827K->4295008K(5777060K), >> 0.0142730 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] >> 2011-05-14T15:48:59.653-0700: 5129.820: [GC 5129.820: [ParNew: >> 18217K->1078K(19136K), 0.0150800 secs] 4312032K->4295035K(5777060K), >> 0.0151780 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.754-0700: 5129.921: [GC 5129.921: [ParNew: >> 18102K->906K(19136K), 0.0149160 secs] 4312059K->4295145K(5777060K), >> 0.0150130 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] >> 2011-05-14T15:48:59.858-0700: 5130.025: [GC 5130.025: [ParNew: >> 17930K->749K(19136K), 0.0177820 secs] 4312169K->4295108K(5777060K), >> 0.0178890 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] >> 2011-05-14T15:48:59.962-0700: 5130.129: [GC 5130.129: [ParNew: >> 17773K->970K(19136K), 0.0145400 secs] 4312132K->4295460K(5777060K), >> 0.0146530 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] >> >> -Jack >> >> On Mon, May 16, 2011 at 12:44 PM, Stack <[email protected]> wrote: >>> What is the size of your new gen? Is it growing? Does it level off? >>> St.Ack >>> >>> On Mon, May 16, 2011 at 12:39 PM, Jack Levin <[email protected]> wrote: >>>> There HEAP would be 8G used out of 12G total. The gc-log would be >>>> full of ParNew, no FULL GC at all, ParNew just starts at hight rate >>>> (10-15 lines per second). Even if you remove queries, the JVM will >>>> not recover, and user CPU will remain 100%. Only thing left to do is >>>> to kill the process. We are not writing in nearly as fast to case >>>> such an issue. This looks like a systemic problem, e.g. low enough >>>> load when it comes to QPS, and the problem effectively kills HBASE. >>>> There should be a way to tune things so that ParNew does not jump to >>>> an inappropriate levels. >>>> >>>> -Jack >>>> >>>> On Mon, May 16, 2011 at 12:02 PM, Stack <[email protected]> wrote: >>>>> On Sun, May 15, 2011 at 5:37 PM, Jack Levin <[email protected]> wrote: >>>>>> I've added occupancy: export HBASE_OPTS="$HBASE_OPTS -verbose:gc >>>>>> -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails >>>>>> -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError >>>>>> -Xloggc:$HBASE_HOME/logs/gc-hbase.log" >>>>>> >>>>> >>>>> Does the big CPU usage correlate to a Full GC? Do you notice that in >>>>> your logs Jack? >>>>> >>>>> I took a look at your thread dumps. Nothing untoward (blocking on hdfs >>>>> access). >>>>> >>>>> St.Ack >>>>> >>>> >>> >> >
