Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Hi, Just curious, was there any resolution to this? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 8. feb. 2011, at 03.40, Markus Jelsma wrote: Do you have GC logging enabled? Tail -f the log file and you'll see what CMS is telling you. Tuning the occupation fraction of the tenured generation to a lower value than default and telling the JVM to only use your value to initiate a collection can help a lot. The same goes for sizing the young generation and sometimes the survivor ratio. Consult the HotSpot CMS settings and young generation (or new) sizes. They are very important. If you have multiple slaves under the same load you can easily try different configurations. Keeping an eye on the nodes with a tool like JConsole and at the same time tailing the GC log will help a lot. Don't forget to send updates and frequent commits or you won't be able to replay. I've never seen a Solr instance go down under heavy load and without commits but they tend to behave badly when commits occur while under heavy load with long cache warming times (and heap consumption). You might also be suffering from memory fragmentation, this is bad and can lead to failure. You can configure the JVM to fore a compaction before a GC, that's nice but it does consume CPU time. A query of death can, in theory, also happen when you sort on a very large dataset that isn't optimized, in this case the maxDoc value is too high. Anyway, try some settings and monitor the nodes and please report your findings. On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said: Heap usage can spike after a commit. Existing caches are still in use and new caches are being generated and/or auto warmed. Can you confirm this is the case? We see spikes after replication which I suspect is, as you say, because of the ensuing commit. What we seem to have found is that when we weren't using the Concurrent GC stop-the-world gc runs would kill the app. Now that we're using CMS we occasionally find ourselves in situations where the app still has memory left over but the load on the machine spikes, the GC duty cycle goes to 100 and the app never recovers. Restarting usually helps but sometimes we have to take the machine out of the laod balancer, wait for a number of minutes and then out it back in. We're working on two hypotheses Firstly - we're CPU bound somehow and that at some point we cross some threshhold and GC or something else is just unable to to keep up. So whilst it looks like instantaneous death of the app it's actually gradual resource exhaustion where the definition of 'gradual' is 'a very short period of time' (as opposed to some cataclysmic infinite loop bug somewhere). Either that or ... Secondly - there's some sort of Query Of Death that kills machines. We just haven't found it yet, even when replaying logs. Or some combination of both. Or other things. It's maddeningly frustrating. We're also got to try deploying a custom solr.war and try using the MMapDirectory to see if that helps with anything.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Wed, Apr 06, 2011 at 12:05:57AM +0200, Jan Høydahl said: Just curious, was there any resolution to this? Not really. We tuned the GC pretty aggressively - we use these options -server -Xmx20G -Xms20G -Xss10M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:SoftRefLRUPolicyMSPerMB=10 and we've played a little with CompressOops and AggressiveOpts. We also backported the MMapDirectory factory to 1.4.1 and that helped a lot. We do still gets spikes of long (5s-20s queries) a few times an hour which don't appear to be caused by any kind of Query of Death. Occasionally (once every few days) one of the slaves will experience a period of sustained slowness but recovers by itself in less than a minute. According to our GC logs we haven't had a full GC for a long time. Currently the state of play is that we commit on our master every 5000ms and replicate from the slaves every 2 minutes. Our reponse times for searches on the slaves are about 180-270ms but if we turn off replication then we get 60-90ms. So something is clearly up with that. Having talked to the good people at Lucid we're going to try playing around with commit intervals, upping our mergeFactor from 10 to 25 and maybe using the BalancedSegmentMergePolicy. The system seems to be stable at the moment which is good but obviously we'd like to lower our query times if possible. Hopefully this might be of some use to somebody out there, sometime. Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Heap usage can spike after a commit. Existing caches are still in use and new caches are being generated and/or auto warmed. Can you confirm this is the case? On Friday 28 January 2011 00:34:42 Simon Wistow wrote: On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said: Are you sure you need CMS incremental mode? It's only adviced when running on a machine with one or two processors. If you have more you should consider disabling the incremental flags. I'll test agin but we added those to get better performance - not much but there did seem to be an improvement. The problem seems to not be in average use but that occasionally there's huge spike in load (there doesn't seem to be a particular killer query) and Solr just never recovers. Thanks, Simon -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said: Heap usage can spike after a commit. Existing caches are still in use and new caches are being generated and/or auto warmed. Can you confirm this is the case? We see spikes after replication which I suspect is, as you say, because of the ensuing commit. What we seem to have found is that when we weren't using the Concurrent GC stop-the-world gc runs would kill the app. Now that we're using CMS we occasionally find ourselves in situations where the app still has memory left over but the load on the machine spikes, the GC duty cycle goes to 100 and the app never recovers. Restarting usually helps but sometimes we have to take the machine out of the laod balancer, wait for a number of minutes and then out it back in. We're working on two hypotheses Firstly - we're CPU bound somehow and that at some point we cross some threshhold and GC or something else is just unable to to keep up. So whilst it looks like instantaneous death of the app it's actually gradual resource exhaustion where the definition of 'gradual' is 'a very short period of time' (as opposed to some cataclysmic infinite loop bug somewhere). Either that or ... Secondly - there's some sort of Query Of Death that kills machines. We just haven't found it yet, even when replaying logs. Or some combination of both. Or other things. It's maddeningly frustrating. We're also got to try deploying a custom solr.war and try using the MMapDirectory to see if that helps with anything.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Tue, Jan 25, 2011 at 01:28:16PM +0100, Markus Jelsma said: Are you sure you need CMS incremental mode? It's only adviced when running on a machine with one or two processors. If you have more you should consider disabling the incremental flags. I'll test agin but we added those to get better performance - not much but there did seem to be an improvement. The problem seems to not be in average use but that occasionally there's huge spike in load (there doesn't seem to be a particular killer query) and Solr just never recovers. Thanks, Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Hi, Are you sure you need CMS incremental mode? It's only adviced when running on a machine with one or two processors. If you have more you should consider disabling the incremental flags. Cheers, On Monday 24 January 2011 19:32:38 Simon Wistow wrote: We have two slaves replicating off one master every 2 minutes. Both using the CMS + ParNew Garbage collector. Specifically -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing but periodically they both get into a GC storm and just keel over. Looking through the GC logs the amount of memory reclaimed in each GC run gets less and less until we get a concurrent mode failure and then Solr effectively dies. Is it possible there's a memory leak? I note that later versions of Lucene have fixed a few leaks. Our current versions are relatively old Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 so I'm wondering if upgrading to later version of Lucene might help (of course it might not but I'm trying to investigate all options at this point). If so what's the best way to go about this? Can I just grab the Lucene jars and drop them somewhere (or unpack and then repack the solr war file?). Or should I use a nightly solr 1.4? Or am I barking up completely the wrong tree? I'm trawling through heap logs and gc logs at the moment trying to to see what other tuning I can do but any other hints, tips, tricks or cluebats gratefully received. Even if it's just Yeah, we had that problem and we added more slaves and periodically restarted them thanks, Simon -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
We have two slaves replicating off one master every 2 minutes. Both using the CMS + ParNew Garbage collector. Specifically -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing but periodically they both get into a GC storm and just keel over. Looking through the GC logs the amount of memory reclaimed in each GC run gets less and less until we get a concurrent mode failure and then Solr effectively dies. Is it possible there's a memory leak? I note that later versions of Lucene have fixed a few leaks. Our current versions are relatively old Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 so I'm wondering if upgrading to later version of Lucene might help (of course it might not but I'm trying to investigate all options at this point). If so what's the best way to go about this? Can I just grab the Lucene jars and drop them somewhere (or unpack and then repack the solr war file?). Or should I use a nightly solr 1.4? Or am I barking up completely the wrong tree? I'm trawling through heap logs and gc logs at the moment trying to to see what other tuning I can do but any other hints, tips, tricks or cluebats gratefully received. Even if it's just Yeah, we had that problem and we added more slaves and periodically restarted them thanks, Simon
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Hi Simon, I got no experiences with a distributed environment. However, what you are talking about reminds me on another post on the mailing list. Could it be possible that your slaves not finished their replicating until the new replication-process starts? If so, there you got the OOM :). Just a thought, perhaps it helps. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Possible-Memory-Leaks-Upgrading-to-a-Later-Version-of-Solr-or-Lucene-tp2321777p2321959.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Mon, Jan 24, 2011 at 08:00:53PM +0100, Markus Jelsma said: Are you using 3rd-party plugins? No third party plugins - this is actually pretty much stock tomcat6 + solr from Ubuntu. The only difference is that we've adapted the directory layout to fit in with our house style
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
On Mon, Jan 24, 2011 at 10:55:59AM -0800, Em said: Could it be possible that your slaves not finished their replicating until the new replication-process starts? If so, there you got the OOM :). This was one of my thoughts as well - we're currently running a slave which has no queries in it just to see if that exhibits similar behaviour. My reasoning against it is that we're not seeing any PERFORMANCE WARNING: Overlapping onDeckSearchers=x in the logs which is something I'd expect to see. 2 minutes doesn't seem like an unreasonable period of time either - the docs at http://wiki.apache.org/solr/SolrReplication suggest 20 seconds.