Re: Solr GC issues - Too many BooleanQuery BooleanClause objects in heap

2012-11-13 Thread Prasanna R
We do have a custom query parser that is responsible for expanding the user
input query into a bunch of prefix, phrase and regular boolean queries in a
manner similar to that done by DisMax.

Analyzing heap with jhat/YourKit is on my list of things to do but I
haven't gotten around to doing it yet. Our big heap size (13G) makes it a
little difficult to do a full blown heap dump analysis.

Thanks a ton for the reply Otis!

Prasanna

On Mon, Nov 12, 2012 at 5:42 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I've never seen this.  You don't have a custom query parser or anything
 else custom, do you?
 Have you tried dumping and analyzing heap?  YourKit has a 7 day eval, or
 you can use things like jhat, which may be included on your machine already
 (see http://docs.oracle.com/javase/6/docs/technotes/tools/share/jhat.html).

 Otis
 --
 Performance Monitoring - http://sematext.com/spm/index.html


 On Mon, Nov 12, 2012 at 8:35 PM, Prasanna R plistma...@gmail.com wrote:

   We have been using Solr in a custom setup where we generate results for
  user queries by expanding it to a large boolean query consisting of
  multiple prefix queries. There have been some GC issues recently with the
  Old/tenured generation becoming nearly 100% full leading to near constant
  full GC cycles.
 
  We are running Solr 3.1 on servers with 13G of heap. jmap live object
  histogram is as follows:
 
  num #instances #bytes  class name
  --
 1:  27441222 1550723760  [Ljava.lang.Object;
 2:  23546318  879258496  [C
 3:  23813405  762028960  java.lang.String
 4:  22700095  726403040  org.apache.lucene.search.BooleanQuery
 5:  27431515  658356360  java.util.ArrayList
 6:  22911883  549885192
  org.apache.lucene.search.BooleanClause
 7:  21651039  519624936  org.apache.lucene.index.Term
 8:   6876651  495118872
  org.apache.lucene.index.FieldsReader$LazyField
 9:  11354214  363334848  org.apache.lucene.search.PrefixQuery
10:   4281624  137011968  java.util.HashMap$Entry
11:   3466680   83200320  org.apache.lucene.search.TermQuery
12:   1987450   79498000  org.apache.lucene.search.PhraseQuery
13:631994   70148624  [Ljava.util.HashMap$Entry;
  .
 
  I have looked at the Solr cache settings multiple times but am not able
 to
  figure out how/why the high number of BooleanQuery and BooleanClause
 object
  instances stay alive. These objects are live and do not get collected
 even
  when the traffic is disabled and a manual GC is triggered which indicates
  that someone is holding onto references.
 
  Can anyone provide more details on the circumstances under which these
  objects stay alive and/or cached? If they are cached then is the caching
  configurable?
 
  Any and all tips/suggestions/pointers will be much appreciated.
 
  Thanks,
 
  Prasanna
 



Re: Handling space variations in queries - matching 'thunderbolt' for query 'thunder bolt'

2011-08-05 Thread Prasanna R
Requesting the community for feedback one more time - Does anyone have any
suggestions/comments regarding this?

Thanks in advance,

Prasanna

On Sat, Jul 30, 2011 at 12:04 AM, Prasanna R plistma...@gmail.com wrote:


 We use a dismax handler with mm 1 in our Solr installation. I have a
 fieldType defined that creates shingles to handle space variations in the
 input strings and user queries. This fieldType can successfully handle cases
 where the query is 'thunderbolt' and the document contains the string
 'thunder bolt' (the shingle results in the token 'thunderbolt' created
 during indexing).  However, due to the pre-analysis whitespace tokenization
 done by lucene query parser, the reverse is not handled well - document with
 string 'thunderbolt' being matched to query 'thunder bolt'.

 I find that in our dismax handler the shingle field records a match and
 scores on the 'pf' but the document is not returned as none of the fields in
 'qf' record a match (mm is 1). I am looking for suggestions on how to handle
 this scenario. Using a synonym will obviously work but it seems a rather
 hackish solution. Is there a more elegant way of achieving a similar effect?


 Alternatively, is there a way to get the 'mm' parameter to factor in
 matches on 'pf' also?

 Kindly help.

 Regards,

 Prasanna



Handling space variations in queries - matching 'thunderbolt' for query 'thunder bolt'

2011-07-30 Thread Prasanna R
We use a dismax handler with mm 1 in our Solr installation. I have a
fieldType defined that creates shingles to handle space variations in the
input strings and user queries. This fieldType can successfully handle cases
where the query is 'thunderbolt' and the document contains the string
'thunder bolt' (the shingle results in the token 'thunderbolt' created
during indexing).  However, due to the pre-analysis whitespace tokenization
done by lucene query parser, the reverse is not handled well - document with
string 'thunderbolt' being matched to query 'thunder bolt'.

I find that in our dismax handler the shingle field records a match and
scores on the 'pf' but the document is not returned as none of the fields in
'qf' record a match (mm is 1). I am looking for suggestions on how to handle
this scenario. Using a synonym will obviously work but it seems a rather
hackish solution. Is there a more elegant way of achieving a similar effect?


Alternatively, is there a way to get the 'mm' parameter to factor in matches
on 'pf' also?

Kindly help.

Regards,

Prasanna


Re: Enhancing Solr relevance functions through predefined constants

2010-06-01 Thread Prasanna R
On Tue, Jun 1, 2010 at 11:57 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

 :
 : I have a suggestion for improving relevance functions in Solr by way of
 : providing access to a set of pre-defined constants in Solr queries.
 : Specifically, the number of documents indexed, the number of unique terms
 in
 : a field, the total number of terms in a field, etc. are some of the
 : query-time constants that I believe can be made use of in function
 queries
 : as well as boosted queries to aid in the relevance calculations.

 I'm not sure if he was inspired by your email or not, but i did notice
 yonik just opened an issue that sounds very similar to this...

 https://issues.apache.org/jira/browse/SOLR-1932


This bug definitely addresses what I had in mind. Glad to see a patch out
for it. I feel this has the potential to become pretty big once we have some
real use cases for it.



 FWIW: number of unique terms in a field is reall, really, expensive to
 compute (although perhaps we could cache it somewhere)


The number of unique terms (and other similar metrics) is pretty much a
query-time constant and we can have it optionally computed and then cached
at the end of every major index build which will make it readily available
for consumption. This will be particularly suited for cases where we have
indexes being built on a node(s) that does not serve traffic and then is
replicated to the servers that handle the traffic.

Prasanna


Enhancing Solr relevance functions through predefined constants

2010-05-25 Thread Prasanna R
Hi all,

I have a suggestion for improving relevance functions in Solr by way of
providing access to a set of pre-defined constants in Solr queries.
Specifically, the number of documents indexed, the number of unique terms in
a field, the total number of terms in a field, etc. are some of the
query-time constants that I believe can be made use of in function queries
as well as boosted queries to aid in the relevance calculations.

One of the tips provided in the Solr 1.4 Enterprise search server book
relating to using function queries is this -  If your data changes in ways
causing you to alter the constants in your function queries, then consider
implementing a periodic automated test of your Solr data to ensure that the
data fits within expected bounds.

I believe that having access to some of the constants mentioned above will
help in coming up with dynamic boost values that adapts as the underlying
data changes. I think this makes sense given that one of the basic relevancy
scoring metric - idf - is directly influenced by the number of documents
indexed.

I can imagine some of these constants being useful in Function queries and
Boosted Queries but am not able to think of a neat little usage example.

I request you all to provide feedback, comments on this idea to help
evaluate if it is worth creating an enhancement jira item for the same.

Thanks,

Prasanna


Re: Implementing Autocomplete/Query Suggest using Solr

2010-01-04 Thread Prasanna R
On Mon, Jan 4, 2010 at 1:20 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R plistma...@gmail.com wrote:

   I looked into the Solr/Lucene classes and found the required
 information.
  Am summarizing the same for the benefit of those that might refer to this
  thread in the future.
 
   The change I had to make was very simple - make a call to getPrefixQuery
  instead of getWildcardQuery in my custom-modified Solr dismax query
 parser
  class. However, this will make a fairly significant difference in terms
 of
  efficiency. The key difference between the lucene WildcardQuery and
  PrefixQuery lies in their respective term enumerators, specifically in
 the
  term comparators. The termCompare method for PrefixQuery is more
  light-weight than that of WildcardQuery and is essentially an
 optimization
  given that a prefix query is nothing but a specialized case of Wildcard
  query. Also, this is why the lucene query parser automatically creates a
  PrefixQuery for query terms of the form 'foo*' instead of a
 WildcardQuery.
 
 
 I don't understand this. There is nothing that one should need to do in
 Solr's code to make this work. Prefix queries are supported out of the box
 in Solr.

  I  am using the dismax query parser and I match on multiple fields with
different boosts. I run a prefix query on some fields in combination with a
regular field query on other fields. I do not know of any way in which one
could specify a prefix query on a particular field in your dismax query out
of the box in Solr 1.4. I had to update Solr to support additional syntax in
a dismax query that lets you choose to create a prefix query on a particular
field. As part of parsing this custom syntax, I was making a call to the
getWildcardQuery which I simply changed to getPrefixQuery.

Prasanna.


Result ordering for Wildcard/Prefix queries or ConstantScoreQueries

2009-12-30 Thread Prasanna R
All documents matched for Wildcard and Prefix queries get the same score as
they are scored as a ConstantScoreQuery. Example query - title:abc*

In such cases, what determines the ordering of the results? Is it simply the
same order in which those document terms appeared when enumerating through
the terms of the field matched in the index?

Also, would it be possible to specify criteria determining the ordering of
such matches? I am assuming that should be possible but have little idea how
that could be done. Kindly provide guidance/help.

Regards,

Prasanna.


Re: Result ordering for Wildcard/Prefix queries or ConstantScoreQueries

2009-12-30 Thread Prasanna R
On Wed, Dec 30, 2009 at 5:04 PM, Grant Ingersoll gsi...@gmail.com wrote:


 On Dec 30, 2009, at 3:21 PM, Prasanna R wrote:

  All documents matched for Wildcard and Prefix queries get the same score
 as
  they are scored as a ConstantScoreQuery. Example query - title:abc*
 
  In such cases, what determines the ordering of the results? Is it simply
 the
  same order in which those document terms appeared when enumerating
 through
  the terms of the field matched in the index?

 I'm assuming they are just in order of internal Lucene doc id, but I'd have
 to look for sure.  There was also some changes to Lucene that allowed the
 collectors to take docs out of order, but again, I'd have to check to see if
 that is the case.

 
  Also, would it be possible to specify criteria determining the ordering
 of
  such matches? I am assuming that should be possible but have little idea
 how
  that could be done. Kindly provide guidance/help.

 Sort?

 What problem are you trying to solve?

 I am using a prefix query to match a bunch of documents and would like to
specify an ordering for the documents matched for that prefix query This is
part of the work I am doing in implementing an autocomplete feature and I am
using the dismax query parser with some custom modifications. I assume you
mean that I can apply a sort ordering to the prefix query matches as part of
the results handler. I was not aware of the same. Will look into that.

Thanks a lot for the help.

Regards,

Prasanna.


Re: Implementing Autocomplete/Query Suggest using Solr

2009-12-29 Thread Prasanna R
 
   We do auto-complete through prefix searches on shingles.
  
 
  Just to confirm, do you mean using EdgeNgram filter to produce letter
  ngrams
  of the tokens in the chosen field?
 
 

 No, I'm talking about prefix search on tokens produced by a ShingleFilter.


 I did not know about the Prefix query parser in Solr. Thanks a lot for
 pointing out the same.

 I find relatively little online material about the Solr/Lucene prefix query
 parser. Kindly point me to any useful resource that I might be missing.


 I looked into the Solr/Lucene classes and found the required information.
Am summarizing the same for the benefit of those that might refer to this
thread in the future.

 The change I had to make was very simple - make a call to getPrefixQuery
instead of getWildcardQuery in my custom-modified Solr dismax query parser
class. However, this will make a fairly significant difference in terms of
efficiency. The key difference between the lucene WildcardQuery and
PrefixQuery lies in their respective term enumerators, specifically in the
term comparators. The termCompare method for PrefixQuery is more
light-weight than that of WildcardQuery and is essentially an optimization
given that a prefix query is nothing but a specialized case of Wildcard
query. Also, this is why the lucene query parser automatically creates a
PrefixQuery for query terms of the form 'foo*' instead of a WildcardQuery.

A big thank you to Shalin for providing valuable guidance and insight.

And one final request for Comment to Shalin on this topic - I am guessing
you ensured there were no duplicate terms in the field(s) used for
autocompletion. For our first version, I am thinking of eliminating the
duplicates outside of the results handler that gives suggestions since
duplicate suggestions originate only from different document IDs in our
system and we do want the list of document IDs matched. Is there a
better/different way of doing the same?

Regards,

Prasanna.


Re: Implementing Autocomplete/Query Suggest using Solr

2009-12-28 Thread Prasanna R
On Wed, Dec 23, 2009 at 10:52 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Thu, Dec 24, 2009 at 2:39 AM, Prasanna R plistma...@gmail.com wrote:

  On Tue, Dec 22, 2009 at 11:49 PM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
  
 I am curious how an approach that simply uses the wildcard query
functionality on an indexed field would work.
  
  
   It works fine as long as the terms are not repeated across documents.
  
  
   I do not follow why terms repeating across documents would be an issue.
 As
  long as you can differentiate between multiple matches and rank them
  properly it should work right?
 
 
 A prefix search would return documents. If a field X being used for
 auto-complete has the same value in two documents then the user will see
 the
 same value being suggested twice.


 That is right. I will have to handle removing duplicate values from the
results returned by the result handler.


  We do auto-complete through prefix searches on shingles.
 

 Just to confirm, do you mean using EdgeNgram filter to produce letter
 ngrams
 of the tokens in the chosen field?



 No, I'm talking about prefix search on tokens produced by a ShingleFilter.


I did not know about the Prefix query parser in Solr. Thanks a lot for
pointing out the same.

I find relatively little online material about the Solr/Lucene prefix query
parser. Kindly point me to any useful resource that I might be missing.

Thanks again for all your help.

Regards,

Prasanna.


Re: Implementing Autocomplete/Query Suggest using Solr

2009-12-23 Thread Prasanna R
On Tue, Dec 22, 2009 at 11:49 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


   I am curious how an approach that simply uses the wildcard query
  functionality on an indexed field would work.


 It works fine as long as the terms are not repeated across documents.


 I do not follow why terms repeating across documents would be an issue. As
long as you can differentiate between multiple matches and rank them
properly it should work right?



  While Solr does not support
  wildcard queries out of the box currently, it will definitely be included
  in
  the future and I believe the edismax parser already lets you do that.


 Solr supports prefix queries and there's a reverse wild card filter in
 trunk
 too.


Are you referring to facet prefix queries as prefix queries? I looked at
reversed wild card filter but think that the regular wild card matching as
opposed to leading wild card matching is better suited for an
auto-completion feature.


 We do auto-complete through prefix searches on shingles.


Just to confirm, do you mean using EdgeNgram filter to produce letter ngrams
of the tokens in the chosen field?

Assuming the regular wild card query would also work, any thoughts on how it
compares to the EdgeNGram approach in terms of added indexing cost,
performance, etc.?

Thanks a lot for your valuable inputs/comments.

Prasanna.


Implementing Autocomplete/Query Suggest using Solr

2009-12-22 Thread Prasanna R
 There seem to be a couple of approaches that people have adopted in
implementing a query suggestion / auto completion feature using Solr.
Depending on the situation, one might use the terms component or go the way
of using EdgeNGramFilters and then creating querying the index on the
ngrammed field. I also found that there is a bug currently active in JIRA (
http://issues.apache.org/jira/browse/SOLR-1316) for creating an auto suggest
component.

 I am curious how an approach that simply uses the wildcard query
functionality on an indexed field would work. While Solr does not support
wildcard queries out of the box currently, it will definitely be included in
the future and I believe the edismax parser already lets you do that. Would
using the wildcard query to implement autocomplete have high overhead and be
less efficient than the other approaches? Am I missing anything here? Kindly
comment and provide some guidance.

Thanks,

Prasanna.