Re: Solr GC issues - Too many BooleanQuery BooleanClause objects in heap
We do have a custom query parser that is responsible for expanding the user input query into a bunch of prefix, phrase and regular boolean queries in a manner similar to that done by DisMax. Analyzing heap with jhat/YourKit is on my list of things to do but I haven't gotten around to doing it yet. Our big heap size (13G) makes it a little difficult to do a full blown heap dump analysis. Thanks a ton for the reply Otis! Prasanna On Mon, Nov 12, 2012 at 5:42 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I've never seen this. You don't have a custom query parser or anything else custom, do you? Have you tried dumping and analyzing heap? YourKit has a 7 day eval, or you can use things like jhat, which may be included on your machine already (see http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html). Otis -- Performance Monitoring - http://sematext.com/spm/index.html On Mon, Nov 12, 2012 at 8:35 PM, Prasanna R plistma...@gmail.com wrote: We have been using Solr in a custom setup where we generate results for user queries by expanding it to a large boolean query consisting of multiple prefix queries. There have been some GC issues recently with the Old/tenured generation becoming nearly 100% full leading to near constant full GC cycles. We are running Solr 3.1 on servers with 13G of heap. jmap live object histogram is as follows: num #instances #bytes class name -- 1: 27441222 1550723760 [Ljava.lang.Object; 2: 23546318 879258496 [C 3: 23813405 762028960 java.lang.String 4: 22700095 726403040 org.apache.lucene.search.BooleanQuery 5: 27431515 658356360 java.util.ArrayList 6: 22911883 549885192 org.apache.lucene.search.BooleanClause 7: 21651039 519624936 org.apache.lucene.index.Term 8: 6876651 495118872 org.apache.lucene.index.FieldsReader$LazyField 9: 11354214 363334848 org.apache.lucene.search.PrefixQuery 10: 4281624 137011968 java.util.HashMap$Entry 11: 3466680 83200320 org.apache.lucene.search.TermQuery 12: 1987450 79498000 org.apache.lucene.search.PhraseQuery 13:631994 70148624 [Ljava.util.HashMap$Entry; . I have looked at the Solr cache settings multiple times but am not able to figure out how/why the high number of BooleanQuery and BooleanClause object instances stay alive. These objects are live and do not get collected even when the traffic is disabled and a manual GC is triggered which indicates that someone is holding onto references. Can anyone provide more details on the circumstances under which these objects stay alive and/or cached? If they are cached then is the caching configurable? Any and all tips/suggestions/pointers will be much appreciated. Thanks, Prasanna
Re: Handling space variations in queries - matching 'thunderbolt' for query 'thunder bolt'
Requesting the community for feedback one more time - Does anyone have any suggestions/comments regarding this? Thanks in advance, Prasanna On Sat, Jul 30, 2011 at 12:04 AM, Prasanna R plistma...@gmail.com wrote: We use a dismax handler with mm 1 in our Solr installation. I have a fieldType defined that creates shingles to handle space variations in the input strings and user queries. This fieldType can successfully handle cases where the query is 'thunderbolt' and the document contains the string 'thunder bolt' (the shingle results in the token 'thunderbolt' created during indexing). However, due to the pre-analysis whitespace tokenization done by lucene query parser, the reverse is not handled well - document with string 'thunderbolt' being matched to query 'thunder bolt'. I find that in our dismax handler the shingle field records a match and scores on the 'pf' but the document is not returned as none of the fields in 'qf' record a match (mm is 1). I am looking for suggestions on how to handle this scenario. Using a synonym will obviously work but it seems a rather hackish solution. Is there a more elegant way of achieving a similar effect? Alternatively, is there a way to get the 'mm' parameter to factor in matches on 'pf' also? Kindly help. Regards, Prasanna
Handling space variations in queries - matching 'thunderbolt' for query 'thunder bolt'
We use a dismax handler with mm 1 in our Solr installation. I have a fieldType defined that creates shingles to handle space variations in the input strings and user queries. This fieldType can successfully handle cases where the query is 'thunderbolt' and the document contains the string 'thunder bolt' (the shingle results in the token 'thunderbolt' created during indexing). However, due to the pre-analysis whitespace tokenization done by lucene query parser, the reverse is not handled well - document with string 'thunderbolt' being matched to query 'thunder bolt'. I find that in our dismax handler the shingle field records a match and scores on the 'pf' but the document is not returned as none of the fields in 'qf' record a match (mm is 1). I am looking for suggestions on how to handle this scenario. Using a synonym will obviously work but it seems a rather hackish solution. Is there a more elegant way of achieving a similar effect? Alternatively, is there a way to get the 'mm' parameter to factor in matches on 'pf' also? Kindly help. Regards, Prasanna
Re: Enhancing Solr relevance functions through predefined constants
On Tue, Jun 1, 2010 at 11:57 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : : I have a suggestion for improving relevance functions in Solr by way of : providing access to a set of pre-defined constants in Solr queries. : Specifically, the number of documents indexed, the number of unique terms in : a field, the total number of terms in a field, etc. are some of the : query-time constants that I believe can be made use of in function queries : as well as boosted queries to aid in the relevance calculations. I'm not sure if he was inspired by your email or not, but i did notice yonik just opened an issue that sounds very similar to this... https://issues.apache.org/jira/browse/SOLR-1932 This bug definitely addresses what I had in mind. Glad to see a patch out for it. I feel this has the potential to become pretty big once we have some real use cases for it. FWIW: number of unique terms in a field is reall, really, expensive to compute (although perhaps we could cache it somewhere) The number of unique terms (and other similar metrics) is pretty much a query-time constant and we can have it optionally computed and then cached at the end of every major index build which will make it readily available for consumption. This will be particularly suited for cases where we have indexes being built on a node(s) that does not serve traffic and then is replicated to the servers that handle the traffic. Prasanna
Enhancing Solr relevance functions through predefined constants
Hi all, I have a suggestion for improving relevance functions in Solr by way of providing access to a set of pre-defined constants in Solr queries. Specifically, the number of documents indexed, the number of unique terms in a field, the total number of terms in a field, etc. are some of the query-time constants that I believe can be made use of in function queries as well as boosted queries to aid in the relevance calculations. One of the tips provided in the Solr 1.4 Enterprise search server book relating to using function queries is this - If your data changes in ways causing you to alter the constants in your function queries, then consider implementing a periodic automated test of your Solr data to ensure that the data fits within expected bounds. I believe that having access to some of the constants mentioned above will help in coming up with dynamic boost values that adapts as the underlying data changes. I think this makes sense given that one of the basic relevancy scoring metric - idf - is directly influenced by the number of documents indexed. I can imagine some of these constants being useful in Function queries and Boosted Queries but am not able to think of a neat little usage example. I request you all to provide feedback, comments on this idea to help evaluate if it is worth creating an enhancement jira item for the same. Thanks, Prasanna
Re: Implementing Autocomplete/Query Suggest using Solr
On Mon, Jan 4, 2010 at 1:20 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R plistma...@gmail.com wrote: I looked into the Solr/Lucene classes and found the required information. Am summarizing the same for the benefit of those that might refer to this thread in the future. The change I had to make was very simple - make a call to getPrefixQuery instead of getWildcardQuery in my custom-modified Solr dismax query parser class. However, this will make a fairly significant difference in terms of efficiency. The key difference between the lucene WildcardQuery and PrefixQuery lies in their respective term enumerators, specifically in the term comparators. The termCompare method for PrefixQuery is more light-weight than that of WildcardQuery and is essentially an optimization given that a prefix query is nothing but a specialized case of Wildcard query. Also, this is why the lucene query parser automatically creates a PrefixQuery for query terms of the form 'foo*' instead of a WildcardQuery. I don't understand this. There is nothing that one should need to do in Solr's code to make this work. Prefix queries are supported out of the box in Solr. I am using the dismax query parser and I match on multiple fields with different boosts. I run a prefix query on some fields in combination with a regular field query on other fields. I do not know of any way in which one could specify a prefix query on a particular field in your dismax query out of the box in Solr 1.4. I had to update Solr to support additional syntax in a dismax query that lets you choose to create a prefix query on a particular field. As part of parsing this custom syntax, I was making a call to the getWildcardQuery which I simply changed to getPrefixQuery. Prasanna.
Result ordering for Wildcard/Prefix queries or ConstantScoreQueries
All documents matched for Wildcard and Prefix queries get the same score as they are scored as a ConstantScoreQuery. Example query - title:abc* In such cases, what determines the ordering of the results? Is it simply the same order in which those document terms appeared when enumerating through the terms of the field matched in the index? Also, would it be possible to specify criteria determining the ordering of such matches? I am assuming that should be possible but have little idea how that could be done. Kindly provide guidance/help. Regards, Prasanna.
Re: Result ordering for Wildcard/Prefix queries or ConstantScoreQueries
On Wed, Dec 30, 2009 at 5:04 PM, Grant Ingersoll gsi...@gmail.com wrote: On Dec 30, 2009, at 3:21 PM, Prasanna R wrote: All documents matched for Wildcard and Prefix queries get the same score as they are scored as a ConstantScoreQuery. Example query - title:abc* In such cases, what determines the ordering of the results? Is it simply the same order in which those document terms appeared when enumerating through the terms of the field matched in the index? I'm assuming they are just in order of internal Lucene doc id, but I'd have to look for sure. There was also some changes to Lucene that allowed the collectors to take docs out of order, but again, I'd have to check to see if that is the case. Also, would it be possible to specify criteria determining the ordering of such matches? I am assuming that should be possible but have little idea how that could be done. Kindly provide guidance/help. Sort? What problem are you trying to solve? I am using a prefix query to match a bunch of documents and would like to specify an ordering for the documents matched for that prefix query This is part of the work I am doing in implementing an autocomplete feature and I am using the dismax query parser with some custom modifications. I assume you mean that I can apply a sort ordering to the prefix query matches as part of the results handler. I was not aware of the same. Will look into that. Thanks a lot for the help. Regards, Prasanna.
Re: Implementing Autocomplete/Query Suggest using Solr
We do auto-complete through prefix searches on shingles. Just to confirm, do you mean using EdgeNgram filter to produce letter ngrams of the tokens in the chosen field? No, I'm talking about prefix search on tokens produced by a ShingleFilter. I did not know about the Prefix query parser in Solr. Thanks a lot for pointing out the same. I find relatively little online material about the Solr/Lucene prefix query parser. Kindly point me to any useful resource that I might be missing. I looked into the Solr/Lucene classes and found the required information. Am summarizing the same for the benefit of those that might refer to this thread in the future. The change I had to make was very simple - make a call to getPrefixQuery instead of getWildcardQuery in my custom-modified Solr dismax query parser class. However, this will make a fairly significant difference in terms of efficiency. The key difference between the lucene WildcardQuery and PrefixQuery lies in their respective term enumerators, specifically in the term comparators. The termCompare method for PrefixQuery is more light-weight than that of WildcardQuery and is essentially an optimization given that a prefix query is nothing but a specialized case of Wildcard query. Also, this is why the lucene query parser automatically creates a PrefixQuery for query terms of the form 'foo*' instead of a WildcardQuery. A big thank you to Shalin for providing valuable guidance and insight. And one final request for Comment to Shalin on this topic - I am guessing you ensured there were no duplicate terms in the field(s) used for autocompletion. For our first version, I am thinking of eliminating the duplicates outside of the results handler that gives suggestions since duplicate suggestions originate only from different document IDs in our system and we do want the list of document IDs matched. Is there a better/different way of doing the same? Regards, Prasanna.
Re: Implementing Autocomplete/Query Suggest using Solr
On Wed, Dec 23, 2009 at 10:52 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 24, 2009 at 2:39 AM, Prasanna R plistma...@gmail.com wrote: On Tue, Dec 22, 2009 at 11:49 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: I am curious how an approach that simply uses the wildcard query functionality on an indexed field would work. It works fine as long as the terms are not repeated across documents. I do not follow why terms repeating across documents would be an issue. As long as you can differentiate between multiple matches and rank them properly it should work right? A prefix search would return documents. If a field X being used for auto-complete has the same value in two documents then the user will see the same value being suggested twice. That is right. I will have to handle removing duplicate values from the results returned by the result handler. We do auto-complete through prefix searches on shingles. Just to confirm, do you mean using EdgeNgram filter to produce letter ngrams of the tokens in the chosen field? No, I'm talking about prefix search on tokens produced by a ShingleFilter. I did not know about the Prefix query parser in Solr. Thanks a lot for pointing out the same. I find relatively little online material about the Solr/Lucene prefix query parser. Kindly point me to any useful resource that I might be missing. Thanks again for all your help. Regards, Prasanna.
Re: Implementing Autocomplete/Query Suggest using Solr
On Tue, Dec 22, 2009 at 11:49 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: I am curious how an approach that simply uses the wildcard query functionality on an indexed field would work. It works fine as long as the terms are not repeated across documents. I do not follow why terms repeating across documents would be an issue. As long as you can differentiate between multiple matches and rank them properly it should work right? While Solr does not support wildcard queries out of the box currently, it will definitely be included in the future and I believe the edismax parser already lets you do that. Solr supports prefix queries and there's a reverse wild card filter in trunk too. Are you referring to facet prefix queries as prefix queries? I looked at reversed wild card filter but think that the regular wild card matching as opposed to leading wild card matching is better suited for an auto-completion feature. We do auto-complete through prefix searches on shingles. Just to confirm, do you mean using EdgeNgram filter to produce letter ngrams of the tokens in the chosen field? Assuming the regular wild card query would also work, any thoughts on how it compares to the EdgeNGram approach in terms of added indexing cost, performance, etc.? Thanks a lot for your valuable inputs/comments. Prasanna.
Implementing Autocomplete/Query Suggest using Solr
There seem to be a couple of approaches that people have adopted in implementing a query suggestion / auto completion feature using Solr. Depending on the situation, one might use the terms component or go the way of using EdgeNGramFilters and then creating querying the index on the ngrammed field. I also found that there is a bug currently active in JIRA ( http://issues.apache.org/jira/browse/SOLR-1316) for creating an auto suggest component. I am curious how an approach that simply uses the wildcard query functionality on an indexed field would work. While Solr does not support wildcard queries out of the box currently, it will definitely be included in the future and I believe the edismax parser already lets you do that. Would using the wildcard query to implement autocomplete have high overhead and be less efficient than the other approaches? Am I missing anything here? Kindly comment and provide some guidance. Thanks, Prasanna.