RE: stemming filter analyzers, any favorites?
Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text bags term type word source start,end 0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt; wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory
RE: stemming filter analyzers, any favorites?
Adding another field with another stemmer and searching both??? Wow never thought of doing that. I guess that doesn't really double the size of your index tho because all the terms are almost the same right? Let me look into that. I'll raise the other issue in a separate thread and thanks. -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 1:55 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text bags term type word source start,end 0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt; wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms
RE: stemming filter analyzers, any favorites?
As far as I know Lucene does not store an inverted index per field, so no, it would not double the size of the index. However, it could influence the score a little bit. For example: If both stemmers reduce schools to school and you are searching for all schools in america the term school has more weight to the resulting score, since it definitly occurs in two fields which consist of nearly the same value. To reduce this effect you could write your own queryParser which creates a disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 - so only the better scoring stemmed-field contributes to the total score of your document. Regards, Em Robert Petersen-3 wrote: Adding another field with another stemmer and searching both??? Wow never thought of doing that. I guess that doesn't really double the size of your index tho because all the terms are almost the same right? Let me look into that. I'll raise the other issue in a separate thread and thanks. -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 1:55 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position1 term textbags term typeword source start,end 0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt; wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1
RE: stemming filter analyzers, any favorites?
Nice! Thanks! -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 9:23 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? As far as I know Lucene does not store an inverted index per field, so no, it would not double the size of the index. However, it could influence the score a little bit. For example: If both stemmers reduce schools to school and you are searching for all schools in america the term school has more weight to the resulting score, since it definitly occurs in two fields which consist of nearly the same value. To reduce this effect you could write your own queryParser which creates a disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 - so only the better scoring stemmed-field contributes to the total score of your document. Regards, Em Robert Petersen-3 wrote: Adding another field with another stemmer and searching both??? Wow never thought of doing that. I guess that doesn't really double the size of your index tho because all the terms are almost the same right? Let me look into that. I'll raise the other issue in a separate thread and thanks. -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Thursday, April 21, 2011 1:55 AM To: solr-user@lucene.apache.org Subject: RE: stemming filter analyzers, any favorites? Hi Robert, we often ran into the same issue with stemmers. This is why we created more than one field, each field with different stemmers. It adds some overhead but worked quite well. Regarding your off-topic-question: Look at the debugging-output of your searches. Sometimes you configured your tools, especially the WDF, wrong and the queryParser creates an unexpected result which leads to unmatched but still relevant documents. Please, show us your debugging-output and the field-definition so that we can provide you some help! Regards, Em Robert Petersen-3 wrote: I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position1 term textbags term typeword source start,end 0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt; wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true
Re: stemming filter analyzers, any favorites?
You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType
RE: stemming filter analyzers, any favorites?
I have been doing that, and for Bags example the trailing 's' is not being removed by the Kstemmer so if indexing the word bags and searching on bag you get no matches. Why wouldn't the trailing 's' get stemmed off? Kstemmer is dictionary based so bags isn't in the dictionary? That trailing 's' should always be dropped no? That seems like it would be better, we don't want to make synonyms for basic use cases like this. I fear I will have to return to the Porter stemmer. Are there other better ones is my main question. Off topic secondary question: sometimes I am puzzled by the output of the analysis page. It seems like there should be a match, but I don't get the results during a search that I'd expect... Like in the case if the WordDelimiterFilterFactory splits up a term into a bunch of terms before the K-stemmer is applied, sometimes if the matching term is in position two of the final analysis but the searcher had the partial term just alone and so thereby in position 1 in the analysis stack then when searching there wasn't a match. Am I reading this correctly? Is that right or should that match and I am misreading my analysis output? Thanks! Robi PS I have a category named Bags and am catching flack for it not coming up in a search for bag. hah PPS the term is not in protwords.txt com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 term text bags term type word source start,end0,4 payload -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, April 20, 2011 10:55 AM To: solr-user@lucene.apache.org Subject: Re: stemming filter analyzers, any favorites? You can get a better sense of exactly what tranformations occur when if you look at the analysis page (be sure to check the verbose checkbox). I'm surprised that bags doesn't match bag, what does the analysis page say? Best Erick On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen rober...@buy.com wrote: Stemming filter analyzers... anyone have any favorites for particular search domains? Just wondering what people are using. I'm using Lucid K Stemmer and having issues. Seems like it misses a lot of common stems. We went to that because of excessively loose matches on the solr.PorterStemFilterFactory I understand K Stemmer is a dictionary based stemmer. Seems to me like it is missing a lot of common stem reductions. Ie Bags does not match Bag in our searches. Here is my analyzer stack: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=query_synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1 / filter class=solr.LowerCaseFilterFactory/ !-- The LucidKStemmer currently requires a lowercase filter somewhere before it. -- filter class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType