Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-04-05 Thread Jan Høydahl
Hi,

Just curious, was there any resolution to this?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 8. feb. 2011, at 03.40, Markus Jelsma wrote:

 Do you have GC logging enabled? Tail -f the log file and you'll see what CMS 
 is 
 telling you. Tuning the occupation fraction of the tenured generation to a 
 lower value than default and telling the JVM to only use your value to 
 initiate a collection can help a lot. The same goes for sizing the young 
 generation and sometimes the survivor ratio.
 
 Consult the HotSpot CMS settings and young generation (or new) sizes. They 
 are 
 very important.
 
 If you have multiple slaves under the same load you can easily try different 
 configurations. Keeping an eye on the nodes with a tool like JConsole and at 
 the same time tailing the GC log will help a lot. Don't forget to send 
 updates 
 and frequent commits or you won't be able to replay. I've never seen a Solr 
 instance go down under heavy load and without commits but they tend to behave 
 badly when commits occur while under heavy load with long cache warming times 
 (and heap consumption).
 
 You might also be suffering from memory fragmentation, this is bad and can 
 lead 
 to failure. You can configure the JVM to fore a compaction before a GC, 
 that's 
 nice but it does consume CPU time.
 
 A query of death can, in theory, also happen when you sort on a very large 
 dataset that isn't optimized, in this case the maxDoc value is too high.
 
 Anyway, try some settings and monitor the nodes and please report your 
 findings.
 
 On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
 Heap usage can spike after a commit. Existing caches are still in use and
 new caches are being generated and/or auto warmed. Can you confirm this
 is the case?
 
 We see spikes after replication which I suspect is, as you say, because
 of the ensuing commit.
 
 What we seem to have found is that when we weren't using the Concurrent
 GC stop-the-world gc runs would kill the app. Now that we're using CMS
 we occasionally find ourselves in situations where the app still has
 memory left over but the load on the machine spikes, the GC duty cycle
 goes to 100 and the app never recovers. 
 Restarting usually helps but sometimes we have to take the machine out
 of the laod balancer, wait for a number of minutes and then out it back
 in.
 
 We're working on two hypotheses
 
 Firstly - we're CPU bound somehow and that at some point we cross some
 threshhold and GC or something else is just unable to to keep up. So
 whilst it looks like instantaneous death of the app it's actually
 gradual resource exhaustion where the definition of 'gradual' is 'a very
 short period of time' (as opposed to some cataclysmic infinite loop bug
 somewhere).
 
 Either that or ... Secondly - there's some sort of Query Of Death that
 kills machines. We just haven't found it yet, even when replaying logs.
 
 Or some combination of both. Or other things. It's maddeningly
 frustrating.
 
 We're also got to try deploying a custom solr.war and try using the
 MMapDirectory to see if that helps with anything.



Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-04-05 Thread Simon Wistow
On Wed, Apr 06, 2011 at 12:05:57AM +0200, Jan Høydahl said:
 Just curious, was there any resolution to this?

Not really.

We tuned the GC pretty aggressively - we use these options

-server 
-Xmx20G -Xms20G -Xss10M
-XX:+UseConcMarkSweepGC 
-XX:+UseParNewGC 
-XX:+CMSIncrementalMode 
-XX:+CMSIncrementalPacing
-XX:SoftRefLRUPolicyMSPerMB=10

and we've played a little with CompressOops and AggressiveOpts.

We also backported the MMapDirectory factory to 1.4.1 and that helped a 
lot.

We do still gets spikes of long (5s-20s queries) a few times an hour 
which don't appear to be caused by any kind of Query of Death. 
Occasionally (once every few days) one of the slaves will experience a 
period of sustained slowness but recovers by itself in less than a 
minute.

According to our GC logs we haven't had a full GC for a long time. 

Currently the state of play is that we commit on our master every 5000ms 
and replicate from the slaves every 2 minutes. Our reponse times for 
searches on the slaves are about 180-270ms but if we turn off 
replication then we get 60-90ms. So something is clearly up with that.

Having talked to the good people at Lucid we're going to try playing 
around with commit intervals, upping our mergeFactor from 10 to 25 and 
maybe using the BalancedSegmentMergePolicy. 

The system seems to be stable at the moment which is good but obviously 
we'd like to lower our query times if possible.

Hopefully this might be of some use to somebody out there, sometime.

Simon




Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-02-07 Thread Markus Jelsma
Heap usage can spike after a commit. Existing caches are still in use and new 
caches are being generated and/or auto warmed. Can you confirm this is the 
case?

On Friday 28 January 2011 00:34:42 Simon Wistow wrote:
 On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
  Are you sure you need CMS incremental mode? It's only adviced when
  running on a machine with one or two processors. If you have more you
  should consider disabling the incremental flags.
 
 I'll test agin but we added those to get better performance - not much
 but there did seem to be an improvement.
 
 The problem seems to not be in average use but that occasionally there's
 huge spike in load (there doesn't seem to be a particular killer
 query) and Solr just never recovers.
 
 Thanks,
 
 Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-02-07 Thread Simon Wistow
On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said:
 Heap usage can spike after a commit. Existing caches are still in use and new 
 caches are being generated and/or auto warmed. Can you confirm this is the 
 case?

We see spikes after replication which I suspect is, as you say, because 
of the ensuing commit.

What we seem to have found is that when we weren't using the Concurrent 
GC stop-the-world gc runs would kill the app. Now that we're using CMS 
we occasionally find ourselves in situations where the app still has 
memory left over but the load on the machine spikes, the GC duty cycle 
goes to 100 and the app never recovers.

Restarting usually helps but sometimes we have to take the machine out 
of the laod balancer, wait for a number of minutes and then out it back 
in.

We're working on two hypotheses 

Firstly - we're CPU bound somehow and that at some point we cross some 
threshhold and GC or something else is just unable to to keep up. So 
whilst it looks like instantaneous death of the app it's actually 
gradual resource exhaustion where the definition of 'gradual' is 'a very 
short period of time' (as opposed to some cataclysmic infinite loop bug 
somewhere).

Either that or ... Secondly - there's some sort of Query Of Death that 
kills machines. We just haven't found it yet, even when replaying logs. 

Or some combination of both. Or other things. It's maddeningly 
frustrating.

We're also got to try deploying a custom solr.war and try using the 
MMapDirectory to see if that helps with anything.







Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-27 Thread Simon Wistow
On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said:
 Are you sure you need CMS incremental mode? It's only adviced when running on 
 a machine with one or two processors. If you have more you should consider 
 disabling the incremental flags.

I'll test agin but we added those to get better performance - not much 
but there did seem to be an improvement.

The problem seems to not be in average use but that occasionally there's 
huge spike in load (there doesn't seem to be a particular killer 
query) and Solr just never recovers.

Thanks,

Simon




Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-25 Thread Markus Jelsma
Hi,

Are you sure you need CMS incremental mode? It's only adviced when running on 
a machine with one or two processors. If you have more you should consider 
disabling the incremental flags.

Cheers,

On Monday 24 January 2011 19:32:38 Simon Wistow wrote:
 We have two slaves replicating off one master every 2 minutes.
 
 Both using the CMS + ParNew Garbage collector. Specifically
 
 -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
 -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
 
 but periodically they both get into a GC storm and just keel over.
 
 Looking through the GC logs the amount of memory reclaimed in each GC
 run gets less and less until we get a concurrent mode failure and then
 Solr effectively dies.
 
 Is it possible there's a memory leak? I note that later versions of
 Lucene have fixed a few leaks. Our current versions are relatively old
 
   Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
 18:06:42
 
   Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
 
 so I'm wondering if upgrading to later version of Lucene might help (of
 course it might not but I'm trying to investigate all options at this
 point). If so what's the best way to go about this? Can I just grab the
 Lucene jars and drop them somewhere (or unpack and then repack the solr
 war file?). Or should I use a nightly solr 1.4?
 
 Or am I barking up completely the wrong tree? I'm trawling through heap
 logs and gc logs at the moment trying to to see what other tuning I can
 do but any other hints, tips, tricks or cluebats gratefully received.
 Even if it's just Yeah, we had that problem and we added more slaves
 and periodically restarted them
 
 thanks,
 
 Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Simon Wistow
We have two slaves replicating off one master every 2 minutes.

Both using the CMS + ParNew Garbage collector. Specifically

-server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing

but periodically they both get into a GC storm and just keel over.

Looking through the GC logs the amount of memory reclaimed in each GC 
run gets less and less until we get a concurrent mode failure and then 
Solr effectively dies.

Is it possible there's a memory leak? I note that later versions of 
Lucene have fixed a few leaks. Our current versions are relatively old

Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 
18:06:42

Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

so I'm wondering if upgrading to later version of Lucene might help (of 
course it might not but I'm trying to investigate all options at this 
point). If so what's the best way to go about this? Can I just grab the 
Lucene jars and drop them somewhere (or unpack and then repack the solr 
war file?). Or should I use a nightly solr 1.4?

Or am I barking up completely the wrong tree? I'm trawling through heap 
logs and gc logs at the moment trying to to see what other tuning I can 
do but any other hints, tips, tricks or cluebats gratefully received. 
Even if it's just Yeah, we had that problem and we added more slaves 
and periodically restarted them

thanks,

Simon


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Em

Hi Simon,

I got no experiences with a distributed environment.
However, what you are talking about reminds me on another post on the
mailing list.

Could it be possible that your slaves not finished their replicating until
the new replication-process starts?
If so, there you got the OOM :).

Just a thought, perhaps it helps.

Regards,
Em
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Possible-Memory-Leaks-Upgrading-to-a-Later-Version-of-Solr-or-Lucene-tp2321777p2321959.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Simon Wistow
On Mon, Jan 24, 2011 at 08:00:53PM +0100, Markus Jelsma said:
 Are you using 3rd-party plugins?

No third party plugins - this is actually pretty much stock tomcat6 + 
solr from Ubuntu. The only difference is that we've adapted the 
directory layout to fit in with our house style


Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

2011-01-24 Thread Simon Wistow
On Mon, Jan 24, 2011 at 10:55:59AM -0800, Em said:
 Could it be possible that your slaves not finished their replicating until
 the new replication-process starts?
 If so, there you got the OOM :).

This was one of my thoughts as well - we're currently running a slave 
which has no queries in it just to see if that exhibits similar 
behaviour.

My reasoning against it is that we're not seeing any 

PERFORMANCE WARNING: Overlapping onDeckSearchers=x

in the logs which is something I'd expect to see.

2 minutes doesn't seem like an unreasonable period of time either - the 
docs at http://wiki.apache.org/solr/SolrReplication suggest 20 seconds.