RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-24 Thread SandeepM

One of our main concerns is the solr returns the best match based on what it
thinks is the best.  It uses Levenshtein's distance metrics to determine the
best suggestions.   Can we tune this to put more weightage on the number of
frequency/hits vs the number of edits ?   If we can tune this, suggestions
would seem more relevant when corrected.Also, if we can do this while
keeping maxCollation = 1 and maxCollationTries = some reasonable number so
that QTime does not go out of control that will be great!   

Any insights into this would be great. Thanks for your help.

Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058655.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-24 Thread Dyer, James
When getting collations there are two steps. 

First, the spellchecker gets individual word choices for each misspelled word.  
By default, these are sorted by string distance first, then document frequency 
second.  You can override this by specifying str 
name=comparatorClassfreq/str in your spellchecker component configuration 
in solrconfig.xml.  The example provided in the distribution has a 
commented-out section explaining this.

In the second step, one correction is taken off each list and checked against 
the index to see if it is a valid collation.  By valid, it needs to return at 
least 1 hit.  The order in which words combinations are tried is dictated by 
the first step.  Once it runs out of tries, runs out of suggestions, or has 
enough valid collations, it stops.  You cannot configure this to try a bunch 
and sort by # hits or anything like that.  You would have to specify a large # 
of collations to be returned and do this in your application.  But this can run 
the risk of a high qtimes.

So you can sort by frequency, but not by hits.  Sorting by hits would mean 
trying a lot of collations and that is probably too expensive.

One caveat is that sorting by frequency could result in far afield results 
being returned to the user.  You might find that lower-frequency, 
smaller-edit-distance suggestions are going to give the user what they want 
more than higher-edit-distance, higher-frequency suggestions.  Just because a 
word is very common doesn't mean it is the right word.  This is why distance 
is the default and not freq.  

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Wednesday, April 24, 2013 12:13 PM
To: solr-user@lucene.apache.org
Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.


One of our main concerns is the solr returns the best match based on what it
thinks is the best.  It uses Levenshtein's distance metrics to determine the
best suggestions.   Can we tune this to put more weightage on the number of
frequency/hits vs the number of edits ?   If we can tune this, suggestions
would seem more relevant when corrected.Also, if we can do this while
keeping maxCollation = 1 and maxCollationTries = some reasonable number so
that QTime does not go out of control that will be great!   

Any insights into this would be great. Thanks for your help.

Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058655.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-23 Thread SandeepM
James, Is there a way to determine how many times the collations were tried?  
Is there a parameter that can be issued that can return this in debug
information?  This would be very helpful.
Appreciate your help with this.

Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058400.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-23 Thread Dyer, James
If you enable debug-level logging for class 
org.apache.solr.spelling.SpellCheckCollator, you should get a log message for 
every collation it tries like this:

Collation:   will return zzz hits.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Tuesday, April 23, 2013 2:13 PM
To: solr-user@lucene.apache.org
Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

James, Is there a way to determine how many times the collations were tried?  
Is there a parameter that can be issued that can return this in debug
information?  This would be very helpful.
Appreciate your help with this.

Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058400.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-22 Thread SandeepM
James, Thanks.  That was very helpful. That helped me understand count and
alternativeTermCount a bit more.

I also have the following case as pointed out earlier...
My query: 

http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit

In this case, the intent is to correct chocolat factry with chocolate
factory which exists in my spell field index. I see a QTime from the above
query as somewhere between 350-400ms 

I run a similar query replacing the spellcheck terms to pursut hapyness
whereas pursuit happyness actually exists in my spell field and I see
QTime of 15-17ms . 

Both query produce collations correctly and picking the first suggestions
and applying them as collation find what I am looking for but there is order
of magnitude difference in QTime.  There is one edit per term in both cases
or 2 edits in each query. The length of words in both these queries seem
identical. I'd like to understand why there is this vast difference in
QTime.  Also Chocolate factory and Pursuit happyness both are spellcheck
indexed as is.

I would appreciate any help with this since I am not sure how I can get any
meaningful performance numbers and attribute the slowness to anything in
particular. 

Thanks.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058048.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-22 Thread Dyer, James
On both queries, set spellcheck.extendedResults=true and also 
spellcheck.collateExtendedResults=true, then post the full spelling response. 
 Also, how long does each query take on average with spellcheck turned off?

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Monday, April 22, 2013 2:02 PM
To: solr-user@lucene.apache.org
Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

James, Thanks.  That was very helpful. That helped me understand count and
alternativeTermCount a bit more.

I also have the following case as pointed out earlier...
My query: 

http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit

In this case, the intent is to correct chocolat factry with chocolate
factory which exists in my spell field index. I see a QTime from the above
query as somewhere between 350-400ms 

I run a similar query replacing the spellcheck terms to pursut hapyness
whereas pursuit happyness actually exists in my spell field and I see
QTime of 15-17ms . 

Both query produce collations correctly and picking the first suggestions
and applying them as collation find what I am looking for but there is order
of magnitude difference in QTime.  There is one edit per term in both cases
or 2 edits in each query. The length of words in both these queries seem
identical. I'd like to understand why there is this vast difference in
QTime.  Also Chocolate factory and Pursuit happyness both are spellcheck
indexed as is.

I would appreciate any help with this since I am not sure how I can get any
meaningful performance numbers and attribute the slowness to anything in
particular. 

Thanks.
Regards,
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058048.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-22 Thread SandeepM
Chocolat Factry


?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime77/int
/lst
result name=response numFound=0 start=0
/result
lst name=spellcheck
  lst name=suggestions
lst name=chocolat
  int name=numFound1/int
  int name=startOffset0/int
  int name=endOffset8/int
  int name=origFreq615/int
  arr name=suggestion
lst
  str name=wordchocolate/str
  int name=freq6544/int
/lst
  /arr
/lst
lst name=factry
  int name=numFound5/int
  int name=startOffset9/int
  int name=endOffset15/int
  int name=origFreq6/int
  arr name=suggestion
lst
  str name=wordfactory/str
  int name=freq23614/int
/lst
lst
  str name=wordfactor/str
  int name=freq5128/int
/lst
lst
  str name=wordfactus/str
  int name=freq290/int
/lst
lst
  str name=wordfactum/str
  int name=freq178/int
/lst
lst
  str name=wordfactae/str
  int name=freq102/int
/lst
  /arr
/lst
bool name=correctlySpelledfalse/bool
lst name=collation
  str name=collationQuerychocolate factory/str
  int name=hits85/int
  lst name=misspellingsAndCorrections
str name=chocolatchocolate/str
str name=factryfactory/str
  /lst
/lst
  /lst
/lst
/response




Pursut Hapyness
?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime16/int
/lst
result name=response numFound=0 start=0
/result
lst name=spellcheck
  lst name=suggestions
lst name=pursut
  int name=numFound5/int
  int name=startOffset0/int
  int name=endOffset6/int
  int name=origFreq0/int
  arr name=suggestion
lst
  str name=wordpursuit/str
  int name=freq1209/int
/lst
lst
  str name=wordpursue/str
  int name=freq108/int
/lst
lst
  str name=wordpursit/str
  int name=freq1/int
/lst
lst
  str name=wordperdut/str
  int name=freq94/int
/lst
lst
  str name=wordpurdue/str
  int name=freq70/int
/lst
  /arr
/lst
lst name=hapyness
  int name=numFound5/int
  int name=startOffset7/int
  int name=endOffset15/int
  int name=origFreq0/int
  arr name=suggestion
lst
  str name=wordhappyness/str
  int name=freq175/int
/lst
lst
  str name=wordhapiness/str
  int name=freq62/int
/lst
lst
  str name=wordhayness/str
  int name=freq1/int
/lst
lst
  str name=wordhappiness/str
  int name=freq7788/int
/lst
lst
  str name=wordharkness/str
  int name=freq324/int
/lst
  /arr
/lst
bool name=correctlySpelledfalse/bool
lst name=collation
  str name=collationQuerypursuit happyness/str
  int name=hits10/int
  lst name=misspellingsAndCorrections
str name=pursutpursuit/str
str name=hapynesshappyness/str
  /lst
/lst
  /lst
/lst
/response

Spellcheck is used separately and we are not using any q along with
spellcheck.

Our search query also queries other fields, not just spellcheck and
therefore does not give a good representation of Qtime.   We use groupings
in the search query.
For Chocolate Factory, I get a search QTime of 198ms
For Pursuit Happyness, I get a search QTime of 318ms

Would appreciate your insights.
Thanks.
-- Sandeep




--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058086.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-22 Thread Dyer, James
This doesn't make a lot of sense to me as in both cases the very first 
collation it tries is the one it is returning.  So you're getting a very 
optimized spellcheck in both cases.  But it does have to issue both queries 2 
times:  the first time, it tries the user's main query anding there are not 
enough hits, it then tries the collation query to see how many hits that will 
return.  Could it be that these two queries just are less/more expensive and 
that difference gets magnified by running each twice?

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Monday, April 22, 2013 4:04 PM
To: solr-user@lucene.apache.org
Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

Chocolat Factry


?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime77/int
/lst
result name=response numFound=0 start=0
/result
lst name=spellcheck
  lst name=suggestions
lst name=chocolat
  int name=numFound1/int
  int name=startOffset0/int
  int name=endOffset8/int
  int name=origFreq615/int
  arr name=suggestion
lst
  str name=wordchocolate/str
  int name=freq6544/int
/lst
  /arr
/lst
lst name=factry
  int name=numFound5/int
  int name=startOffset9/int
  int name=endOffset15/int
  int name=origFreq6/int
  arr name=suggestion
lst
  str name=wordfactory/str
  int name=freq23614/int
/lst
lst
  str name=wordfactor/str
  int name=freq5128/int
/lst
lst
  str name=wordfactus/str
  int name=freq290/int
/lst
lst
  str name=wordfactum/str
  int name=freq178/int
/lst
lst
  str name=wordfactae/str
  int name=freq102/int
/lst
  /arr
/lst
bool name=correctlySpelledfalse/bool
lst name=collation
  str name=collationQuerychocolate factory/str
  int name=hits85/int
  lst name=misspellingsAndCorrections
str name=chocolatchocolate/str
str name=factryfactory/str
  /lst
/lst
  /lst
/lst
/response




Pursut Hapyness
?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime16/int
/lst
result name=response numFound=0 start=0
/result
lst name=spellcheck
  lst name=suggestions
lst name=pursut
  int name=numFound5/int
  int name=startOffset0/int
  int name=endOffset6/int
  int name=origFreq0/int
  arr name=suggestion
lst
  str name=wordpursuit/str
  int name=freq1209/int
/lst
lst
  str name=wordpursue/str
  int name=freq108/int
/lst
lst
  str name=wordpursit/str
  int name=freq1/int
/lst
lst
  str name=wordperdut/str
  int name=freq94/int
/lst
lst
  str name=wordpurdue/str
  int name=freq70/int
/lst
  /arr
/lst
lst name=hapyness
  int name=numFound5/int
  int name=startOffset7/int
  int name=endOffset15/int
  int name=origFreq0/int
  arr name=suggestion
lst
  str name=wordhappyness/str
  int name=freq175/int
/lst
lst
  str name=wordhapiness/str
  int name=freq62/int
/lst
lst
  str name=wordhayness/str
  int name=freq1/int
/lst
lst
  str name=wordhappiness/str
  int name=freq7788/int
/lst
lst
  str name=wordharkness/str
  int name=freq324/int
/lst
  /arr
/lst
bool name=correctlySpelledfalse/bool
lst name=collation
  str name=collationQuerypursuit happyness/str
  int name=hits10/int
  lst name=misspellingsAndCorrections
str name=pursutpursuit/str
str name=hapynesshappyness/str
  /lst
/lst
  /lst
/lst
/response

Spellcheck is used separately and we are not using any q along with
spellcheck.

Our search query also queries other fields, not just spellcheck and
therefore does not give a good representation of Qtime.   We use groupings
in the search query.
For Chocolate Factory, I get a search QTime of 198ms
For Pursuit Happyness, I get a search QTime of 318ms

Would appreciate your insights.
Thanks.
-- Sandeep




--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058086.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-19 Thread Dyer, James
I guess the first thing I'd do is to set maxCollationTries to zero.  This 
means it will only run your main query once and not re-run it to check the 
collations. Now see if your queries have consistent qtime.  One easy 
explanation is that with maxCollationTries=10, it may be running your query 
up to 11 times to check up to 10 possible collations.  If the query takes 50ms 
by itself, then you've got 550ms total to not find spelling corrections.  
Unfortunately, the worst case here is the one that gives the user nothing back. 
 

Another thing to look at, with maxCollationTries at zero, set maxCollations 
to 10.  This will give you a list of the 10 collations it would have tried.  
You can figure if the one that gets hits is far enough down the list to explain 
the high total qtime when maxCollationTries=10.  If this explains it, then 
the obvious solution is to set maxCollationTries to something lower than 10.  
(you'll need tio weigh how long you're willing to make your users wait to 
possibly get spelling suggestions)  Or possibly, use spellcheck.q to give it 
an easier query to evalutate than the main query (but that can still give valid 
collations). Also, see https://issues.apache.org/jira/browse/SOLR-3240 which is 
an optimization for this feature.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Thursday, April 18, 2013 11:33 PM
To: solr-user@lucene.apache.org
Subject: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

Hi!

I am using SOLR 4.2.1.

My solrconfig.xml contains the following:

  searchComponent name=MySpellcheck class=solr.SpellCheckComponent
   str name=queryAnalyzerFieldTypetext_spell/str

 lst name=spellchecker
   str name=nameMySpellchecker/str
   str name=fieldspell/str
   str name=classnamesolr.DirectSolrSpellChecker/str
   str name=distanceMeasureinternal/str
   float name=accuracy0.5/float
   int name=maxEdits2/int
   int name=minPrefix1/int
   int name=maxInspections5/int
   int name=minQueryLength3/int
   float name=maxQueryFrequency0.01/float
   
 /lst
 /searchComponent

requestHandler name=/select class=solr.SearchHandler startup=lazy
lst name=defaults
  int name=rows10/int
  str name=dfid/str
  str name=spellcheck.dictionaryMySpellchecker/str
  str name=spellcheckon/str
  str name=spellcheck.extendedResultsfalse/str
  str name=spellcheck.count10/str
  str name=spellcheck.alternativeTermCount10/str
  str name=spellcheck.maxResultsForSuggest35/str
  str name=spellcheck.onlyMorePopulartrue/str
  str name=spellcheck.collatetrue/str
  str name=spellcheck.collateExtendedResultsfalse/str
  str name=spellcheck.maxCollationTries10/str
  str name=spellcheck.maxCollations1/str
  str name=spellcheck.collateParam.q.opAND/str
/lst
arr name=last-components
  strMySpellcheck/str
/arr
  /requestHandler

schema.xml with the spell field looks like:

fieldType name=text_spell class=solr.TextField
positionIncrementGap=100  sortMissingLast=true 
analyzer type=index
tokenizer
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory
/
filter class=solr.StopFilterFactory
ignoreCase=true
 words=lang/stopwords_en.txt
enablePositionIncrements=true /
/analyzer
analyzer type=query
tokenizer
class=solr.StandardTokenizerFactory /
filter class=solr.LowerCaseFilterFactory
/
filter class=solr.StopFilterFactory
ignoreCase=true
 words=lang/stopwords_en.txt
enablePositionIncrements=true /
/analyzer
/fieldType

field name=spell type=text_spell indexed=true
stored=false multiValued=true /

copyField source=title dest=spell /
copyField source=artist dest=spell /
 
My query:
http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit

In this case, the intent is to correct chocolat factry with chocolate
factory which exists in my spell field index. I see a QTime from the above
query as somewhere between 350-400ms

I run a similar query replacing the spellcheck terms to pursut hapyness
whereas pursuit happyness actually exists in my spell field and I see
QTime of 15-17ms .

Both query produce collations correctly but there is order of magnitude
difference in QTime.  There is one edit per term in both cases or 2 edits in
each query. The length of words in both these queries seem identical. I'd
like to understand why there is this vast difference in QTime.  I would
appreciate 

RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-19 Thread SandeepM
James,
Thanks for the reply.  I see your point and sure enough, reducing
maxCollationTries does reduce time, however may not produce results.
It seems like the time is taken for the collations re-runs.  Is there any
way we can activate caching for collations.  The same query repeatedly takes
the same amount of time.  My queryCaches are activated, however don't
believe it gets used for spellchecks.
Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

2013-04-19 Thread Dyer, James
I do not know what it would take to have the collation tests make betetr use of 
the QueryResultCache.  However, outside of a test scenario, I do not know if 
this would help a lot.

Hopefully you wouldn't have a lot of users issuing the exact same query with 
the exact same misspelled words over and over.  In the real world, if you find 
that a collation is a better query than the one the user intially issued, then 
when that user pages through results, etc, your application should use the 
corrected query and not re-run the incorrect query over and over again.  In the 
case of maxResultsForSuggest, if a user does the first query then rejects any 
did-you-mean suggstions, you can just turn spellcheck off if they page, 
facet, etc, so that you don't have to generate these suggestions over and over 
again.  

You do have to weigh when setting maxCollationTries whether or not it is 
acceptable to make a user with a misspelled query wait 1/2 second or so to 
(hopefully) get a correction, or if you want to simply reduce the maximum time 
someone will have to wait.  If you find that it usually needs 10 tries to find 
a good collation, then you probably need to try a different distance algorithm, 
or play with the various accuracy settings to see if you can get better 
corrections to be nearer the top of the individual-word lists.  Also, try 
setting alternativeTermCount lower than count (maybe set atc to 1/2 of 
what you have count).  This will reduce the number of terms it has to try 
combinations of.  If you set maxResultsForSuggest to a lower value (like 2-3, 
maybe), then it won't try to return did-you-mean suggestions for queries 
returning (was it 35?!) hits.

As I mentioned, SOLR-3240 does have promise of speeding this feature up so 
maybe we won't have to talk about these kinds of trade-offs so much in the 
future.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: SandeepM [mailto:skmi...@hotmail.com] 
Sent: Friday, April 19, 2013 12:48 PM
To: solr-user@lucene.apache.org
Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.

James,
Thanks for the reply.  I see your point and sure enough, reducing
maxCollationTries does reduce time, however may not produce results.
It seems like the time is taken for the collations re-runs.  Is there any
way we can activate caching for collations.  The same query repeatedly takes
the same amount of time.  My queryCaches are activated, however don't
believe it gets used for spellchecks.
Thanks.
-- Sandeep



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html
Sent from the Solr - User mailing list archive at Nabble.com.