RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Markus Jelsma Thu, 07 Jun 2012 06:38:02 -0700

Hello!

-----Original message-----
> From:Dyer, James <james.d...@ingrambook.com>
> Sent: Wed 06-Jun-2012 17:23
> To: solr-user@lucene.apache.org
> Subject: RE: issues with spellcheck.maxCollationTries and 
> spellcheck.collateExtendedResults
> 
> Markus,
> 
> With "maxCollationTries=0", it is not going out and querying the collations 
> to see how many hits they each produce.  So it doesn't know the # of hits.  
> That is why if you also specify "collateExtendedResults=true", all the hit 
> counts are zero.  It would probably be better in this case if it would not 
> report "hits" in the extended response at all.  (On the other hand, if you're 
> seeing zeros and "maxCollationTries>0", then you've hit a bug!)


I see. It would indeed make sense to get rid of the hits field when it's always 
zero anyway if maxCollationTries=0. Despite your recent explanations it raises 
some confusion.

> 
> "thresholdTokenFrequency" in my opinion is a pretty blunt instrument for 
> getting rid of bad suggestions.  It takes out all of the rare terms, 
> presuming that if a term is rare in the data it either is a mistake or isn't 
> worthy to be suggested ever.  But if you're using "maxCollationTries" the 
> suggestions that don't fit will be filtered out automatically, making 
> "thresholdTokenFrequency" to be needed less.  (On the other hand, if you're 
> using IndexBasedSpellChecker, "thresholdTokenFrequency" will make the 
> dictionary smaller and "spellcheck.build" run faster...  This is solved 
> entirely in 4.0 with DirectSolrSpellChecker...) 

I forgot to mention this is with the DirectSolrSpellChecker. I guess we'll just 
have to try working with the thresholdTokenFrequency. It's difficult, however, 
because the index will grow but changes are that at some point the rare, but 
correct, token drops below the threshold and is not suggested anymore. We also 
see the benefit from the threshold since our index is human editted and 
contains rare but misspelled words.

> 
> For the apps here, I've been using "maxCollationTries=10" and have been 
> getting good results.  Keep in mind that even though you're allowing it to 
> try up to 10 queries to find a viable collation, so long as you're setting 
> "maxCollations" to something low it will (hopefully) seldom need to try more 
> than a couple before finding one with hits.  (I always ask for only 1 
> collation as we just re-apply the spelling correction automatically if the 
> original query returned nothing).  Also, if "spellcheck.count" is low it 
> might not have enough terms available to try, so you might need to raise this 
> value also if raising "maxCollationTries".

We have a similar set-up and require only one collation to be returned. I can 
increase maxCollationTries.

> 
> The worse problem, in my opinion is the fact that it won't ever suggest words 
> if they're in the index (even if using "thresholdTokenFrequency" to remove 
> them from the dictionary).  For that there is 
> https://issues.apache.org/jira/browse/SOLR-2585 which is part of Solr 4.  The 
> only other workaround is "onlyMorePopular" which has its own issues.  (see 
> http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount).

We don't really like onlyMorePopular since more hits is not always a better 
suggestion. We decided to turn it off quite some time ago. Also because of 
SOLR-2555.AlternativeTermCount may indeed be a solution.

Thanks, we'll manage for now.

> 
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Wednesday, June 06, 2012 5:22 AM
> To: solr-user@lucene.apache.org
> Subject: issues with spellcheck.maxCollationTries and 
> spellcheck.collateExtendedResults
> 
> Hi,
> 
> We've had some issues with a bad zero-hits collation being returned for a two 
> word query where one word was only one edit away from the required collation. 
> With spellcheck.maxCollations to a reasonable number we saw the various 
> suggestions without the required collation. We decreased 
> thresholdTokenFrequency to make it appear in the list of collations. However, 
> with collateExtendedResults=true the hits field for each collation was zero, 
> which is incorrect.
> 
> Required collation=huub stapel (two hits) and q=huup stapel
> 
>       "collation":{
>         "collationQuery":"heup stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"heup"}},
>       "collation":{
>         "collationQuery":"hugo stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hugo"}},
>       "collation":{
>         "collationQuery":"hulp stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hulp"}},
>       "collation":{
>         "collationQuery":"hup stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"hup"}},
>       "collation":{
>         "collationQuery":"huub stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"huub"}},
>       "collation":{
>         "collationQuery":"huur stapel",
>         "hits":0,
>         "misspellingsAndCorrections":{
>           "huup":"huur"}}}}}
> 
> Now, with maxCollationTries set to 3 or higher we finally get the required 
> collation and the only collation able to return results. How can we determine 
> the best value for maxCollationTries regarding the decrease of the 
> thresholdTokenFrequency? Why is hits always zero?
> 
> This is with a today's build and distributed search enabled.
> 
> Thanks,
> Markus
>

RE: issues with spellcheck.maxCollationTries and spellcheck.collateExtendedResults

Reply via email to