We have two slaves replicating off one master every 2 minutes.
Both using the CMS + ParNew Garbage collector. Specifically
-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
but periodically they both get into a GC storm and just keel over.
Looking through the GC logs the amount of memory reclaimed in each GC
run gets less and less until we get a concurrent mode failure and then
Solr effectively dies.
Is it possible there's a memory leak? I note that later versions of
Lucene have fixed a few leaks. Our current versions are relatively old
Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
18:06:42
Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
so I'm wondering if upgrading to later version of Lucene might help (of
course it might not but I'm trying to investigate all options at this
point). If so what's the best way to go about this? Can I just grab the
Lucene jars and drop them somewhere (or unpack and then repack the solr
war file?). Or should I use a nightly solr 1.4?
Or am I barking up completely the wrong tree? I'm trawling through heap
logs and gc logs at the moment trying to to see what other tuning I can
do but any other hints, tips, tricks or cluebats gratefully received.
Even if it's just "Yeah, we had that problem and we added more slaves
and periodically restarted them"
thanks,
Simon