On 29/05/2012 20:50, Kevin wrote: > > > On 5/29/12 11:09 AM, "Christopher Schultz" <ch...@christopherschultz.net> > wrote: > > Kevin, > > (FYI these are always a PITA to diagnose...) > > On 5/29/12 11:03 AM, Kevin wrote: >>>> We have a cluster of tomcat servers being used on a very high >>>> volume website. We've noticed that for the first 5-6 hours after an >>>> application re-load that Full GC will run every 2 minutes pausing >>>> the application for anywhere between 5 and 20 seconds. After 5-6 >>>> hours the full GC will no longer run until tomcat is reloaded. >>>> This behavior is repeatable. >>>> >>>> We discounted our EhCache settings as potential issue because >>>> tomcat can be killed without this issue showing up. It's not until >>>> we actually reload the WAR with changes (clearing the work >>>> directory) and restart tomcat that this issue shows up. Traffic >>>> level isn't a factor we'll go through our peak hours with no >>>> problem. > > All good things to know. > >>>> The servers are all dual quad core with 32GB of RAM running Centos >>>> 5 and tomcat 6.0.18. We've attempted every suggested GC setting >>>> change suggested on the net and by developers and system >>>> administrators. > > Did you try looking at what the cause might be, or just blindly apply > suggestions to see if something would stick? > >> Initially we looked closely at it trying to solve the problem as time wore >> on we tried anything. This has been going on for months now. > > > >>>> Our java opts are being played with on a daily basis however here >>>> are the settings that correspond to the sample GC log below: >>>> >>>> >>>> -server -Xmx27g -Xms27g -XX:+DisableExplicitGC >>>> -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution >>>> -Dsun.rmi.dgc.client.gcInterval=900000 >>>> -Dsun.rmi.dgc.server.gcInterval=900000 -XX:NewSize=8g >>>> -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps >>>> -XX:+PrintGCDetails > > A new size of 8GiB seems high to me, but I've never run with a total > heap size of anything nearly as large as 27GiB, so maybe that's a > reasonable size. My opinion here is probably irrelevant. > >>>> 215.501: [GC 215.504: [ParNew Desired survivor size 238583808 >>>> bytes, new threshold 15 (max 15) - age 1: 50457968 bytes, >>>> 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs] >>>> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68 >>>> sys=0.00, real=0.07 secs] > > Odd that user+sys < real. Hmm. > > [...] > >>>> 316.456: [GC 316.459: [ParNew Desired survivor size 238583808 >>>> bytes, new threshold 15 (max 15) - age 1: 41430416 bytes, >>>> 41430416 total - age 3: 22728376 bytes, 64158792 total - age >>>> 5: 19599960 bytes, 83758752 total - age 6: 21847616 bytes, >>>> 105606368 total - age 7: 27667592 bytes, 133273960 total - age >>>> 8: 10904 bytes, 133284864 total - age 9: 31824256 bytes, >>>> 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs] >>>> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45 >>>> sys=0.01, real=0.14 secs] > > 1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed > above. Have you been able to actually instrument a 5s-20s GC-pause? > Or, are you saying that your webapp appears to stall for that long and > those observances coincide with full-GC runs? > >>>> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K), >>>> 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm : >>>> 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87 >>>> sys=0.02, real=3.41 secs] > > Also relatively short. > >> They fluctuate. > > >>>> **Last Full GC** >>>> >>>> 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808 >>>> bytes, new threshold 15 (max 15) - age 1: 33948208 bytes, >>>> 33948208 total - age 2: 88280 bytes, 34036488 total - age >>>> 3: 19872472 bytes, 53908960 total - age 4: 16072608 bytes, >>>> 69981568 total - age 5: 15718712 bytes, 85700280 total - age >>>> 6: 15771016 bytes, 101471296 total - age 7: 16895976 bytes, >>>> 118367272 total - age 8: 24233728 bytes, 142601000 total : >>>> 7618727K->200950K(7922624K), 0.1728420 secs] >>>> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21 >>>> sys=0.01, real=0.18 secs] > > Note that the above is not a "Full GC". This one /is/: > >>>> 20526.469: [Full GC 20526.475: [CMS: >>>> 9175755K->9210800K(19922944K), 33.1161300 secs] >>>> 13632232K->9210800K(27845568K), [CMS Perm : >>>> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12 >>>> sys=0.02, real=33.12 secs] > > Okay, that one's a doosey. > > It's weird that it looks like after the full GC, the heap has *lost* > some space ("CMS: 9175755K->9210800K"). Can you watch the logs and > take a heap dump (which may take a LONG time with such a big heap, so > be prepared for another long pause) after one of these full GCs? > > It would be better to get a heap dump both before AND after the full > GC, but that can ruin your performance if you enable those automated > GC options. > >> We can run this after our next code release in two days. > > >>>> Any help would be greatly appreciated. > > Are you populating huge caches on startup or anything like that? > Generational GC performance is related to the number of surviving > objects that have to be moved from one generation to the next, so if > you have a huge population of objects surviving each collection one > after the other (which is what I might expect if you had an enormous > cache being fulled all at once) then your performance will suck for a > while until those objects are tenured enough to avoid periodic > collections. > >> This was our initial assumption however since we're using an in memory >> EhCache we would expect this issue to re-accure when tomcat is restarted. >> It doesn't. It'll only show up if we stop tomcat, reload the WAR and >> clean the work directory.
Which version of EhCache? How large is the cache, are you using RMI replication or disk overflow? p > Are you able to inspect the various generations before and after these > pauses? I'd be interested to see if you have a huge population of > objects being tenured at once and what types of objects they are (they > might be something obvious like instances of our.big.fat.CacheEntry or > something like that). > >> We'll give it a try > >> Thanks > > > > -chris >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > -- [key:62590808]
signature.asc
Description: OpenPGP digital signature