On Sep 5, 10:43 pm, Mark Mandel <[email protected]> wrote:
> I've been working through a leak with Brian G, but have been unable to
> reproduce it locally, so I have no idea what has happened there.  He seems
> to have conquered it by using Java6u10.

So I've been working through a memory issue exactly as the original
poster describes since around March.

I have a two-node cluster using Transfersync.  I was seeing one of the
two nodes more quickly rise up in memory usage where, within 48 hours,
MAYBE 72, we would be forced to restart the server.  The other node
would suffer the same fate but it would take a little longer.  It
didn't matter whether basic server monitoring was on or off.

After extensive debugging including taking heap dumps and working with
Mark, we identified that the reapMap, which is part of the caching
framework, had a buttload of references to objects that were
preventing them from being GC'd properly.  Mark hasn't been able to
reproduce but clearly there are some scenarios under which this
happens.  I can consistently reproduce it and over the last 6 months
we have had to restart our app at a minimum of every 48 hours.  Our
production app is CentOS 5 Linux using CF8.0.1 multi-server install.
One of the symptoms was way more full GCs and much fewer Young GCs on
the "sick" node which led to dramatically longer full GCs (more work
to do, we're talking like 10 second full freezes or more).  This is
what led to requests queuing up and timing out.  In our case, even
when the heap would eventually hit 100%, it didn't stop functioning,
it just ran so slowly that no requests would finish within our
timeout.

Last week, I was grasping for straws and started looking at other
possibilities.  We started noticing our issues in March but we have
seasonal traffic so I thought that it was just increased traffic that
was causing the issue to be visible.  Then I noticed that it was
around early March that we upgraded from 1.6.0 u10, which was a HUGE
performance boost over the stock JVM, to 1.6.0 u12.  We subsequently
went to 1.6.0 u14 as well.

I rolled back to 1.6.0 u10 last Tuesday night, 7.5 days ago, and today
my app is still up and running without a restart!

We're currently running with about 30% free memory on our 2GB max.  I
don't believe the problem is actually solved but switching back to u10
(and subsequently installing CHF3 on top of our previous CHF2 for
8.0.1) has radically improved our situation.  I was actually able to
leave town this weekend and not freak out.

Looking at the jstat details, things are looking much more consistent
between the two nodes and healthier in general.  Here's a little
output from each node taken just now after 7.5 days up:

[r...@web4 brian]# /usr/java/latest/bin/jstat -gcutil 7846
  S0     S1     E        O       P        YGC    YGCT      FGC
FGCT     GCT
  0.00  33.70  94.65  74.50  91.78  25665  574.200   181   704.372
1278.572

[r...@web5 brian]# /usr/java/latest/bin/jstat -gcutil 26432
  S0     S1     E        O        P       YGC    YGCT      FGC
FGCT     GCT
  0.00  15.84  52.48  66.85  92.96  19735  499.723   182   762.452
1262.174

So Old Generation, the column with "O", is the space you have to stick
objects essentially.  When it gets to 100%, you're out of space.  The
Garbage collector goes through and tries to clean up objects that are
no longer being referenced.  But anything in the application scope or
session scope for example will continue to live on until infinity if
the references aren't released.

The issue in Transfer is that the reapMap, which is part of Transfer
which is generally in the Application scope, is holding onto
references that should be popped into a ReferenceQueue for the GC to
clean up.  Since the references are still there, the GC won't kill
them.  Eventually this fills up the available heap and you run into an
unresponsive server.  I also suspect that the SoftReferenceHandler
thread as well as Full GCs take a long ass time BECAUSE there are so
many objects in the queue to be cleared which are never cleared.
That's just speculation on my part but it seems to match my experience
which is Transfer performance degrades the longer the system is up
(this was when I was running on u12 and u14; with u10 I have no
results yet but don't seem to be experiencing it).  For sure, when
your Old generation gets close to full, you wind up thrashing around
with GCs in order to free up memory.  It's like if your computer has
2GB of ram and you're using 1.95GB of it and you want to load a really
big file into memory.

The good news in the above two jstat outputs is that the two nodes are
more or less consistent.  You'll notice that one has quite a few more
YGCs and that's because more traffic is currently routed to web4 which
was a feeble attempt to keep web5 up longer before I switched JVMs.
Web5 correspondingly has fewer YGCs but the same number of FGCs which
is what I would expect.  Less transient traffic (fewer sessions) would
mean fewer objects to reap in young GCs.

The good news is things are running MUCH better.

The bad news is that I'm pretty sure there is still an issue.  I took
a heap dump after a couple days of uptime and I still found about 6k
references in the reapMap.  I think there should be on the order of
hundreds and the number should go up and down as the cache clears out
expired objects.  Previously I had ~61k objects in the reapMap when
the heap was full.

Now that things are improved, it may be worthwhile to take a heap dump
today and see if there are any recurring patterns about which objects
are stuck in the reap map and which aren't.  We're also looking at
migrating to EC2 so it's encouraging to hear that the problem might
just disappear when we switch environments.  I suppose I shouldn't tar
up my install and move it over then? :)  I'm also wondering if the
problem will just go away with CF9 too.

Hope that detail helps... you are definitely not alone.  Try U10 and
see if it makes a difference?


Brian
--~--~---------~--~----~------------~-------~--~----~
Before posting questions to the group please read:
http://groups.google.com/group/transfer-dev/web/how-to-ask-support-questions-on-transfer

You received this message because you are subscribed to the Google Groups 
"transfer-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/transfer-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to