RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Dyer, James Wed, 06 Jun 2012 08:23:22 -0700

Markus,

With "maxCollationTries=0", it is not going out and querying the collations to 
see how many hits they each produce.  So it doesn't know the # of hits.  That 
is why if you also specify "collateExtendedResults=true", all the hit counts 
are zero.  It would probably be better in this case if it would not report 
"hits" in the extended response at all.  (On the other hand, if you're seeing 
zeros and "maxCollationTries>0", then you've hit a bug!)


"thresholdTokenFrequency" in my opinion is a pretty blunt instrument for 
getting rid of bad suggestions.  It takes out all of the rare terms, presuming 
that if a term is rare in the data it either is a mistake or isn't worthy to be 
suggested ever.  But if you're using "maxCollationTries" the suggestions that 
don't fit will be filtered out automatically, making "thresholdTokenFrequency" 
to be needed less.  (On the other hand, if you're using IndexBasedSpellChecker, 
"thresholdTokenFrequency" will make the dictionary smaller and 
"spellcheck.build" run faster...  This is solved entirely in 4.0 with 
DirectSolrSpellChecker...) 

For the apps here, I've been using "maxCollationTries=10" and have been getting 
good results.  Keep in mind that even though you're allowing it to try up to 10 
queries to find a viable collation, so long as you're setting "maxCollations" 
to something low it will (hopefully) seldom need to try more than a couple 
before finding one with hits.  (I always ask for only 1 collation as we just 
re-apply the spelling correction automatically if the original query returned 
nothing).  Also, if "spellcheck.count" is low it might not have enough terms 
available to try, so you might need to raise this value also if raising 
"maxCollationTries".

The worse problem, in my opinion is the fact that it won't ever suggest words 
if they're in the index (even if using "thresholdTokenFrequency" to remove them 
from the dictionary).  For that there is 
https://issues.apache.org/jira/browse/SOLR-2585 which is part of Solr 4.  The 
only other workaround is "onlyMorePopular" which has its own issues.  (see 
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount).

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, June 06, 2012 5:22 AM
To: solr-user@lucene.apache.org
Subject: issues with spellcheck.maxCollationTries and 
spellcheck.collateExtendedResults

Hi,

We've had some issues with a bad zero-hits collation being returned for a two 
word query where one word was only one edit away from the required collation. 
With spellcheck.maxCollations to a reasonable number we saw the various 
suggestions without the required collation. We decreased 
thresholdTokenFrequency to make it appear in the list of collations. However, 
with collateExtendedResults=true the hits field for each collation was zero, 
which is incorrect.

Required collation=huub stapel (two hits) and q=huup stapel

      "collation":{
        "collationQuery":"heup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"heup"}},
      "collation":{
        "collationQuery":"hugo stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hugo"}},
      "collation":{
        "collationQuery":"hulp stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hulp"}},
      "collation":{
        "collationQuery":"hup stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"hup"}},
      "collation":{
        "collationQuery":"huub stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huub"}},
      "collation":{
        "collationQuery":"huur stapel",
        "hits":0,
        "misspellingsAndCorrections":{
          "huup":"huur"}}}}}

Now, with maxCollationTries set to 3 or higher we finally get the required 
collation and the only collation able to return results. How can we determine 
the best value for maxCollationTries regarding the decrease of the 
thresholdTokenFrequency? Why is hits always zero?

This is with a today's build and distributed search enabled.

Thanks,
Markus

RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Reply via email to