On 29/05/2012 20:50, Kevin wrote:
> 
> 
> On 5/29/12 11:09 AM, "Christopher Schultz" <ch...@christopherschultz.net>
> wrote:
> 
> Kevin,
> 
> (FYI these are always a PITA to diagnose...)
> 
> On 5/29/12 11:03 AM, Kevin wrote:
>>>> We have a cluster of tomcat servers being used on a very high
>>>> volume website. We've noticed that for the first 5-6 hours after an
>>>> application re-load that Full GC will run every 2 minutes pausing
>>>> the application for anywhere between 5 and 20 seconds. After 5-6
>>>> hours the full GC will no longer run until tomcat is reloaded.
>>>> This behavior is repeatable.
>>>>
>>>> We discounted our EhCache settings as potential issue because
>>>> tomcat can be killed without this issue showing up.  It's not until
>>>> we actually reload the WAR with changes (clearing the work
>>>> directory) and restart tomcat that this issue shows up.  Traffic
>>>> level isn't a factor we'll go through our peak hours with no
>>>> problem.
> 
> All good things to know.
> 
>>>> The servers are all dual quad core with 32GB of RAM running Centos
>>>> 5 and tomcat 6.0.18.  We've attempted every suggested GC setting
>>>> change suggested on the net and by developers and system
>>>> administrators.
> 
> Did you try looking at what the cause might be, or just blindly apply
> suggestions to see if something would stick?
> 
>> Initially we looked closely at it trying to solve the problem as time wore
>> on we tried anything.  This has been going on for months now.
> 
> 
> 
>>>> Our java opts are being played with on a daily basis however here
>>>> are the settings that correspond to the sample GC log below:
>>>>
>>>>
>>>> -server -Xmx27g -Xms27g  -XX:+DisableExplicitGC
>>>> -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution
>>>> -Dsun.rmi.dgc.client.gcInterval=900000
>>>> -Dsun.rmi.dgc.server.gcInterval=900000 -XX:NewSize=8g
>>>> -XX:SurvivorRatio=16 -verbose:gc -XX:+PrintGCTimeStamps
>>>> -XX:+PrintGCDetails
> 
> A new size of 8GiB seems high to me, but I've never run with a total
> heap size of anything nearly as large as 27GiB, so maybe that's a
> reasonable size. My opinion here is probably irrelevant.
> 
>>>> 215.501: [GC 215.504: [ParNew Desired survivor size 238583808
>>>> bytes, new threshold 15 (max 15) - age   1:   50457968 bytes,
>>>> 50457968 total : 7456799K->111048K(7922624K), 0.0617110 secs]
>>>> 8614906K->1269155K(27845568K), 0.0661400 secs] [Times: user=0.68
>>>> sys=0.00, real=0.07 secs]
> 
> Odd that user+sys < real. Hmm.
> 
> [...]
> 
>>>> 316.456: [GC 316.459: [ParNew Desired survivor size 238583808
>>>> bytes, new threshold 15 (max 15) - age   1:   41430416 bytes,
>>>> 41430416 total - age   3:   22728376 bytes,   64158792 total - age
>>>> 5:   19599960 bytes,   83758752 total - age   6:   21847616 bytes,
>>>> 105606368 total - age   7:   27667592 bytes,  133273960 total - age
>>>> 8:      10904 bytes,  133284864 total - age   9:   31824256 bytes,
>>>> 165109120 total : 7650333K->215213K(7922624K), 0.1332630 secs]
>>>> 8808440K->1373320K(27845568K), 0.1380590 secs] [Times: user=1.45
>>>> sys=0.01, real=0.14 secs]
> 
> 1.45 seconds for a full-GC isn't anywhere near the 5s-20s claimed
> above. Have you been able to actually instrument a 5s-20s GC-pause?
> Or, are you saying that your webapp appears to stall for that long and
> those observances coincide with full-GC runs?
> 
>>>> 343.376: [Full GC 343.378: [CMS: 1158107K->1312570K(19922944K),
>>>> 3.4129290 secs] 2884580K->1312570K(27845568K), [CMS Perm :
>>>> 83964K->47203K(83968K)], 3.4168600 secs] [Times: user=3.87
>>>> sys=0.02, real=3.41 secs]
> 
> Also relatively short.
> 
>> They fluctuate. 
> 
> 
>>>> **Last Full GC**
>>>>
>>>> 20517.892: [GC 20517.898: [ParNew Desired survivor size 238583808
>>>> bytes, new threshold 15 (max 15) - age   1:   33948208 bytes,
>>>> 33948208 total - age   2:      88280 bytes,   34036488 total - age
>>>> 3:   19872472 bytes,   53908960 total - age   4:   16072608 bytes,
>>>> 69981568 total - age   5:   15718712 bytes,   85700280 total - age
>>>> 6:   15771016 bytes,  101471296 total - age   7:   16895976 bytes,
>>>> 118367272 total - age   8:   24233728 bytes,  142601000 total :
>>>> 7618727K->200950K(7922624K), 0.1728420 secs]
>>>> 16794482K->9376705K(27845568K), 0.1822350 secs] [Times: user=2.21
>>>> sys=0.01, real=0.18 secs]
> 
> Note that the above is not a "Full GC". This one /is/:
> 
>>>> 20526.469: [Full GC 20526.475: [CMS:
>>>> 9175755K->9210800K(19922944K), 33.1161300 secs]
>>>> 13632232K->9210800K(27845568K), [CMS Perm :
>>>> 83967K->53332K(83968K)], 33.1254170 secs] [Times: user=33.12
>>>> sys=0.02, real=33.12 secs]
> 
> Okay, that one's a doosey.
> 
> It's weird that it looks like after the full GC, the heap has *lost*
> some space ("CMS: 9175755K->9210800K"). Can you watch the logs and
> take a heap dump (which may take a LONG time with such a big heap, so
> be prepared for another long pause) after one of these full GCs?
> 
> It would be better to get a heap dump both before AND after the full
> GC, but that can ruin your performance if you enable those automated
> GC options.
> 
>> We can run this after our next code release in two days.
> 
> 
>>>> Any help would be greatly appreciated.
> 
> Are you populating huge caches on startup or anything like that?
> Generational GC performance is related to the number of surviving
> objects that have to be moved from one generation to the next, so if
> you have a huge population of objects surviving each collection one
> after the other (which is what I might expect if you had an enormous
> cache being fulled all at once) then your performance will suck for a
> while until those objects are tenured enough to avoid periodic
> collections.
> 
>> This was our initial assumption however since we're using an in memory
>> EhCache we would expect this issue to re-accure when tomcat is restarted.
>> It doesn't.  It'll only show up if we stop tomcat, reload the WAR and
>> clean the work directory.

Which version of EhCache?  How large is the cache, are you using RMI
replication or disk overflow?


p


> Are you able to inspect the various generations before and after these
> pauses? I'd be interested to see if you have a huge population of
> objects being tenured at once and what types of objects they are (they
> might be something obvious like instances of our.big.fat.CacheEntry or
> something like that).
> 
>> We'll give it a try
> 
>> Thanks
> 
> 
> 
> -chris
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
> 

-- 

[key:62590808]

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to