Re: Spellcheck Phrases
Please start a new thread for this question, see: http://people.apache.org/~hossman/#threadhijack When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. Best Erick On Tue, Aug 23, 2011 at 11:47 AM, Herman Kiefus herm...@angieslist.com wrote: The angle that I am trying here is to create a dictionary from indexed terms that contain only correctly spelled words. We are doing this by having the field from which the dictionary is created utilize a type that employs solr.KeepWordFilterFactory, which in turn utilizes a text file of known correctly spelled words (including their respective derivations example: lead, leads, leading, etc.). This is working great for us with the exception being those fields in our schema that contain proper names. I can't seem to get (unfiltered) terms from those fields along with (correctly spelled) terms from other fields into the single field upon which the dictionary is built. -Original Message- From: Dyer, James [mailto:james.d...@ingrambook.com] Sent: Thursday, June 02, 2011 11:40 AM To: solr-user@lucene.apache.org Subject: RE: Spellcheck Phrases Actually, someone just pointed out to me that a patch like this is unnecessary. The code works as-is if configured like this: float name=thresholdTokenFrequency.01/float (correct) instead of this: str name=thresholdTokenFrequency.01/str (incorrect) I tested this and it seems to work. I'm still am trying to figure out if using this parameter actually improves the quality of our spell suggestions, now that I know how to use it properly. Sorry about the mis-information earlier. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Dyer, James Sent: Wednesday, June 01, 2011 3:02 PM To: solr-user@lucene.apache.org Subject: RE: Spellcheck Phrases Tanner, I just entered SOLR-2571 to fix the float-parsing-bug that breaks thresholdTokenFrequency. Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1. See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches. This parameter appears absent from the wiki. And as it has always been broken for me, I haven't tested it. However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary. For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary. This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ... searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext/str str name=spellcheckIndexDir./spellchecker/str str name=thresholdTokenFrequency.01/str /lst /searchComponent James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Friday, May 27, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck Phrases are there any updates on this? any third party apps that can make this work as expected? On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote: Tanner, Currently Solr will only make suggestions for words that are not in the dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However, if you do that, then it will try to improve every word in your query, even the ones that are spelled correctly (so while it might change brake to break it might also change leg to log.) You might be able to alleviate some of the pain by setting the thresholdTokenFrequency so as to remove misspelled and rarely-used words from your dictionary, although I personally haven't been able to get this parameter to work. It also doesn't seem to be documented on the wiki but it is in the 1.4.1. source code, in class IndexBasedSpellChecker. Its also mentioned in SmileyPugh's book. I tried setting it like this, but got a ClassCastException on the float value: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spelling/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext_spelling/str str name=buildOnOptimizetrue/str str name=thresholdTokenFrequency.001/str
RE: Spellcheck Phrases
The angle that I am trying here is to create a dictionary from indexed terms that contain only correctly spelled words. We are doing this by having the field from which the dictionary is created utilize a type that employs solr.KeepWordFilterFactory, which in turn utilizes a text file of known correctly spelled words (including their respective derivations example: lead, leads, leading, etc.). This is working great for us with the exception being those fields in our schema that contain proper names. I can't seem to get (unfiltered) terms from those fields along with (correctly spelled) terms from other fields into the single field upon which the dictionary is built. -Original Message- From: Dyer, James [mailto:james.d...@ingrambook.com] Sent: Thursday, June 02, 2011 11:40 AM To: solr-user@lucene.apache.org Subject: RE: Spellcheck Phrases Actually, someone just pointed out to me that a patch like this is unnecessary. The code works as-is if configured like this: float name=thresholdTokenFrequency.01/float (correct) instead of this: str name=thresholdTokenFrequency.01/str (incorrect) I tested this and it seems to work. I'm still am trying to figure out if using this parameter actually improves the quality of our spell suggestions, now that I know how to use it properly. Sorry about the mis-information earlier. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Dyer, James Sent: Wednesday, June 01, 2011 3:02 PM To: solr-user@lucene.apache.org Subject: RE: Spellcheck Phrases Tanner, I just entered SOLR-2571 to fix the float-parsing-bug that breaks thresholdTokenFrequency. Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1. See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches. This parameter appears absent from the wiki. And as it has always been broken for me, I haven't tested it. However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary. For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary. This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ... searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext/str str name=spellcheckIndexDir./spellchecker/str str name=thresholdTokenFrequency.01/str /lst /searchComponent James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Friday, May 27, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck Phrases are there any updates on this? any third party apps that can make this work as expected? On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote: Tanner, Currently Solr will only make suggestions for words that are not in the dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However, if you do that, then it will try to improve every word in your query, even the ones that are spelled correctly (so while it might change brake to break it might also change leg to log.) You might be able to alleviate some of the pain by setting the thresholdTokenFrequency so as to remove misspelled and rarely-used words from your dictionary, although I personally haven't been able to get this parameter to work. It also doesn't seem to be documented on the wiki but it is in the 1.4.1. source code, in class IndexBasedSpellChecker. Its also mentioned in SmileyPugh's book. I tried setting it like this, but got a ClassCastException on the float value: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spelling/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext_spelling/str str name=buildOnOptimizetrue/str str name=thresholdTokenFrequency.001/str /lst /searchComponent I have it on my to-do list to look into this further but haven't yet. If you decide to try it and can get it to work, please let me know how you do it. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Wednesday, February 23, 2011 12:53 PM To: solr-user@lucene.apache.org Subject: Spellcheck Phrases right now when I search for 'brake a leg', solr returns valid results with no indication of misspelling, which is understandable since all of those terms are valid words
RE: Spellcheck Phrases
Actually, someone just pointed out to me that a patch like this is unnecessary. The code works as-is if configured like this: float name=thresholdTokenFrequency.01/float (correct) instead of this: str name=thresholdTokenFrequency.01/str (incorrect) I tested this and it seems to work. I'm still am trying to figure out if using this parameter actually improves the quality of our spell suggestions, now that I know how to use it properly. Sorry about the mis-information earlier. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Dyer, James Sent: Wednesday, June 01, 2011 3:02 PM To: solr-user@lucene.apache.org Subject: RE: Spellcheck Phrases Tanner, I just entered SOLR-2571 to fix the float-parsing-bug that breaks thresholdTokenFrequency. Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1. See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches. This parameter appears absent from the wiki. And as it has always been broken for me, I haven't tested it. However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary. For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary. This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ... searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext/str str name=spellcheckIndexDir./spellchecker/str str name=thresholdTokenFrequency.01/str /lst /searchComponent James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Friday, May 27, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck Phrases are there any updates on this? any third party apps that can make this work as expected? On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote: Tanner, Currently Solr will only make suggestions for words that are not in the dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However, if you do that, then it will try to improve every word in your query, even the ones that are spelled correctly (so while it might change brake to break it might also change leg to log.) You might be able to alleviate some of the pain by setting the thresholdTokenFrequency so as to remove misspelled and rarely-used words from your dictionary, although I personally haven't been able to get this parameter to work. It also doesn't seem to be documented on the wiki but it is in the 1.4.1. source code, in class IndexBasedSpellChecker. Its also mentioned in SmileyPugh's book. I tried setting it like this, but got a ClassCastException on the float value: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spelling/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext_spelling/str str name=buildOnOptimizetrue/str str name=thresholdTokenFrequency.001/str /lst /searchComponent I have it on my to-do list to look into this further but haven't yet. If you decide to try it and can get it to work, please let me know how you do it. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Wednesday, February 23, 2011 12:53 PM To: solr-user@lucene.apache.org Subject: Spellcheck Phrases right now when I search for 'brake a leg', solr returns valid results with no indication of misspelling, which is understandable since all of those terms are valid words and are probably found in a few pieces of our content. My question is: is there any way for it to recognize that the phase should be break a leg and not brake a leg and suggest the proper phrase?
RE: Spellcheck Phrases
Tanner, I just entered SOLR-2571 to fix the float-parsing-bug that breaks thresholdTokenFrequency. Its just a 1-line code fix so I also included a patch that should cleanly apply to solr 3.1. See https://issues.apache.org/jira/browse/SOLR-2571 for info and patches. This parameter appears absent from the wiki. And as it has always been broken for me, I haven't tested it. However, my understanding it should be set as the minimum percentage of documents in which a term has to occur in order for it to appear in the spelling dictionary. For instance in the config below, a term would have to occur in at least 1% of the documents for it to be part of the spelling dictionary. This might be a good setting for long fields but for the short fields in my application, I was thinking of setting this to something like 1/1000 of 1% ... searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext/str str name=spellcheckIndexDir./spellchecker/str str name=thresholdTokenFrequency.01/str /lst /searchComponent James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Friday, May 27, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: Spellcheck Phrases are there any updates on this? any third party apps that can make this work as expected? On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote: Tanner, Currently Solr will only make suggestions for words that are not in the dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However, if you do that, then it will try to improve every word in your query, even the ones that are spelled correctly (so while it might change brake to break it might also change leg to log.) You might be able to alleviate some of the pain by setting the thresholdTokenFrequency so as to remove misspelled and rarely-used words from your dictionary, although I personally haven't been able to get this parameter to work. It also doesn't seem to be documented on the wiki but it is in the 1.4.1. source code, in class IndexBasedSpellChecker. Its also mentioned in SmileyPugh's book. I tried setting it like this, but got a ClassCastException on the float value: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spelling/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext_spelling/str str name=buildOnOptimizetrue/str str name=thresholdTokenFrequency.001/str /lst /searchComponent I have it on my to-do list to look into this further but haven't yet. If you decide to try it and can get it to work, please let me know how you do it. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Wednesday, February 23, 2011 12:53 PM To: solr-user@lucene.apache.org Subject: Spellcheck Phrases right now when I search for 'brake a leg', solr returns valid results with no indication of misspelling, which is understandable since all of those terms are valid words and are probably found in a few pieces of our content. My question is: is there any way for it to recognize that the phase should be break a leg and not brake a leg and suggest the proper phrase?
Re: Spellcheck Phrases
are there any updates on this? any third party apps that can make this work as expected? On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote: Tanner, Currently Solr will only make suggestions for words that are not in the dictionary, unless you specifiy spellcheck.onlyMorePopular=true. However, if you do that, then it will try to improve every word in your query, even the ones that are spelled correctly (so while it might change brake to break it might also change leg to log.) You might be able to alleviate some of the pain by setting the thresholdTokenFrequency so as to remove misspelled and rarely-used words from your dictionary, although I personally haven't been able to get this parameter to work. It also doesn't seem to be documented on the wiki but it is in the 1.4.1. source code, in class IndexBasedSpellChecker. Its also mentioned in SmileyPugh's book. I tried setting it like this, but got a ClassCastException on the float value: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spelling/str lst name=spellchecker str name=namespellchecker/str str name=fieldSpelling_Dictionary/str str name=fieldTypetext_spelling/str str name=buildOnOptimizetrue/str str name=thresholdTokenFrequency.001/str /lst /searchComponent I have it on my to-do list to look into this further but haven't yet. If you decide to try it and can get it to work, please let me know how you do it. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Tanner Postert [mailto:tanner.post...@gmail.com] Sent: Wednesday, February 23, 2011 12:53 PM To: solr-user@lucene.apache.org Subject: Spellcheck Phrases right now when I search for 'brake a leg', solr returns valid results with no indication of misspelling, which is understandable since all of those terms are valid words and are probably found in a few pieces of our content. My question is: is there any way for it to recognize that the phase should be break a leg and not brake a leg and suggest the proper phrase?
Spellcheck Phrases
right now when I search for 'brake a leg', solr returns valid results with no indication of misspelling, which is understandable since all of those terms are valid words and are probably found in a few pieces of our content. My question is: is there any way for it to recognize that the phase should be break a leg and not brake a leg and suggest the proper phrase?
Autsuggest/autocomplete/spellcheck phrases
How can I preserve phrases for either autosuggest/autocomplete/spellcheck? For example we have a bunch of product listings and I want if someone types: louis for it to common up with Louis Vuitton. World ... World cup. Would I need n-grams? Shingling? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Autsuggest-autocomplete-spellcheck-phrases-tp902951p902951.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autsuggest/autocomplete/spellcheck phrases
Blargy, I've been experimenting with this myself for a work project. What I did was use a combination of the two running the indexed terms through the Shingle factory and then through the edge n-gram filter. I did this in order to be able to match terms like : .net asp c# asp .net c# c# asp .net c# asp.net for a word query like asp c# .net The edge ngrams are good, but they can also fail to match on queries when the words in the index are in a different order than those in the query. My setup in schema.xml looks like this : fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=2 outputUnigrams=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer /fieldType Let me know how this works for you. On Thu, Jun 17, 2010 at 11:05 AM, Blargy zman...@hotmail.com wrote: How can I preserve phrases for either autosuggest/autocomplete/spellcheck? For example we have a bunch of product listings and I want if someone types: louis for it to common up with Louis Vuitton. World ... World cup. Would I need n-grams? Shingling? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Autsuggest-autocomplete-spellcheck-phrases-tp902951p902951.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autsuggest/autocomplete/spellcheck phrases
Thanks for the reply Michael. Ill definitely try that out and let you know how it goes. Your solution sounds similar to the one I've read here: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ There are some good comments in there too. I think I am having the biggest trouble distinguishing what needs to be done for autocomplete/autosuggestion (google like behavior) and a separate issue involving spellchecking (Did you mean...). I guess I originally thought those 2 distinct features would involve the same solution but it appears that they are completely different. Your solution sounds like its works best for autocomplete and I will be using it for that exact purpose ;) One question though... how do you handle more popular words/documents over others? Now my next question is, how would I get spellchecker to work with phrases. So if I typed vitton it would come back with something like: Did you mean: 'Louis Vuitton'? Will this also require a combination of ngrams and shingles? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Autsuggest-autocomplete-spellcheck-phrases-tp902951p903225.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autsuggest/autocomplete/spellcheck phrases
We base the auto-suggest on popular searches. Our site logs the search terms in a database and a simple query can give us a summary counting the number of times the search was entered and the number of results it returned, similar to the criteria used in the lucid imagination article you cite. Each record includes the search terms, the total number of times it was entered and the maximum number of hits returned. Each record is fed in as a document. On a regular interval, older documents are deleted and newer ones are added. On Thu, Jun 17, 2010 at 12:29 PM, Blargy zman...@hotmail.com wrote: Thanks for the reply Michael. Ill definitely try that out and let you know how it goes. Your solution sounds similar to the one I've read here: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ There are some good comments in there too. I think I am having the biggest trouble distinguishing what needs to be done for autocomplete/autosuggestion (google like behavior) and a separate issue involving spellchecking (Did you mean...). I guess I originally thought those 2 distinct features would involve the same solution but it appears that they are completely different. Your solution sounds like its works best for autocomplete and I will be using it for that exact purpose ;) One question though... how do you handle more popular words/documents over others? Now my next question is, how would I get spellchecker to work with phrases. So if I typed vitton it would come back with something like: Did you mean: 'Louis Vuitton'? Will this also require a combination of ngrams and shingles? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Autsuggest-autocomplete-spellcheck-phrases-tp902951p903225.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Autsuggest/autocomplete/spellcheck phrases
Ok that makes perfect sense. What I did was use a combination of the two running the indexed terms through - I initially read this as you used your current index and use the terms from that to buildup your dictionary. -- View this message in context: http://lucene.472066.n3.nabble.com/Autsuggest-autocomplete-spellcheck-phrases-tp902951p903299.html Sent from the Solr - User mailing list archive at Nabble.com.