On Sep 5, 10:43 pm, Mark Mandel <[email protected]> wrote: > I've been working through a leak with Brian G, but have been unable to > reproduce it locally, so I have no idea what has happened there. He seems > to have conquered it by using Java6u10.
So I've been working through a memory issue exactly as the original poster describes since around March. I have a two-node cluster using Transfersync. I was seeing one of the two nodes more quickly rise up in memory usage where, within 48 hours, MAYBE 72, we would be forced to restart the server. The other node would suffer the same fate but it would take a little longer. It didn't matter whether basic server monitoring was on or off. After extensive debugging including taking heap dumps and working with Mark, we identified that the reapMap, which is part of the caching framework, had a buttload of references to objects that were preventing them from being GC'd properly. Mark hasn't been able to reproduce but clearly there are some scenarios under which this happens. I can consistently reproduce it and over the last 6 months we have had to restart our app at a minimum of every 48 hours. Our production app is CentOS 5 Linux using CF8.0.1 multi-server install. One of the symptoms was way more full GCs and much fewer Young GCs on the "sick" node which led to dramatically longer full GCs (more work to do, we're talking like 10 second full freezes or more). This is what led to requests queuing up and timing out. In our case, even when the heap would eventually hit 100%, it didn't stop functioning, it just ran so slowly that no requests would finish within our timeout. Last week, I was grasping for straws and started looking at other possibilities. We started noticing our issues in March but we have seasonal traffic so I thought that it was just increased traffic that was causing the issue to be visible. Then I noticed that it was around early March that we upgraded from 1.6.0 u10, which was a HUGE performance boost over the stock JVM, to 1.6.0 u12. We subsequently went to 1.6.0 u14 as well. I rolled back to 1.6.0 u10 last Tuesday night, 7.5 days ago, and today my app is still up and running without a restart! We're currently running with about 30% free memory on our 2GB max. I don't believe the problem is actually solved but switching back to u10 (and subsequently installing CHF3 on top of our previous CHF2 for 8.0.1) has radically improved our situation. I was actually able to leave town this weekend and not freak out. Looking at the jstat details, things are looking much more consistent between the two nodes and healthier in general. Here's a little output from each node taken just now after 7.5 days up: [r...@web4 brian]# /usr/java/latest/bin/jstat -gcutil 7846 S0 S1 E O P YGC YGCT FGC FGCT GCT 0.00 33.70 94.65 74.50 91.78 25665 574.200 181 704.372 1278.572 [r...@web5 brian]# /usr/java/latest/bin/jstat -gcutil 26432 S0 S1 E O P YGC YGCT FGC FGCT GCT 0.00 15.84 52.48 66.85 92.96 19735 499.723 182 762.452 1262.174 So Old Generation, the column with "O", is the space you have to stick objects essentially. When it gets to 100%, you're out of space. The Garbage collector goes through and tries to clean up objects that are no longer being referenced. But anything in the application scope or session scope for example will continue to live on until infinity if the references aren't released. The issue in Transfer is that the reapMap, which is part of Transfer which is generally in the Application scope, is holding onto references that should be popped into a ReferenceQueue for the GC to clean up. Since the references are still there, the GC won't kill them. Eventually this fills up the available heap and you run into an unresponsive server. I also suspect that the SoftReferenceHandler thread as well as Full GCs take a long ass time BECAUSE there are so many objects in the queue to be cleared which are never cleared. That's just speculation on my part but it seems to match my experience which is Transfer performance degrades the longer the system is up (this was when I was running on u12 and u14; with u10 I have no results yet but don't seem to be experiencing it). For sure, when your Old generation gets close to full, you wind up thrashing around with GCs in order to free up memory. It's like if your computer has 2GB of ram and you're using 1.95GB of it and you want to load a really big file into memory. The good news in the above two jstat outputs is that the two nodes are more or less consistent. You'll notice that one has quite a few more YGCs and that's because more traffic is currently routed to web4 which was a feeble attempt to keep web5 up longer before I switched JVMs. Web5 correspondingly has fewer YGCs but the same number of FGCs which is what I would expect. Less transient traffic (fewer sessions) would mean fewer objects to reap in young GCs. The good news is things are running MUCH better. The bad news is that I'm pretty sure there is still an issue. I took a heap dump after a couple days of uptime and I still found about 6k references in the reapMap. I think there should be on the order of hundreds and the number should go up and down as the cache clears out expired objects. Previously I had ~61k objects in the reapMap when the heap was full. Now that things are improved, it may be worthwhile to take a heap dump today and see if there are any recurring patterns about which objects are stuck in the reap map and which aren't. We're also looking at migrating to EC2 so it's encouraging to hear that the problem might just disappear when we switch environments. I suppose I shouldn't tar up my install and move it over then? :) I'm also wondering if the problem will just go away with CF9 too. Hope that detail helps... you are definitely not alone. Try U10 and see if it makes a difference? Brian --~--~---------~--~----~------------~-------~--~----~ Before posting questions to the group please read: http://groups.google.com/group/transfer-dev/web/how-to-ask-support-questions-on-transfer You received this message because you are subscribed to the Google Groups "transfer-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/transfer-dev?hl=en -~----------~----~----~----~------~----~------~--~---
