Re: camel-casing and dismax troubles
On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young ge...@modperlcookbook.org wrote: hi all :) I'm having trouble with camel-cased query strings and the dismax handler. a user query LeAnn Rimes isn't matching the indexed term Leann Rimes This is the camel-case case that can't currently be handled by a single WordDelimiterFilter. If the indexeddoc had LeAnn, then it would be indexed as le,ann/leann and hence queries of both forms le ann and leann would match. However since the indexed term is simply leann, a WordDelimiterFilter configured to split won't match (a search for LeAnn will be translated into a search for le ann. One way to work around this now is to do a copyField into another field that catenates split terms in the query analyzer instead of generating/splitting, and then search across both fields. BTW, your parsed query below shows you turned on both catenation and generation (or perhaps preserveOriginal) for split subwords in your query analyzer. Unfortunately this configuration doesn't work due to the ambiguity of what it means to have multiple terms at the same position (this is the same problem for multi-word synonyms at query time). The query shown below looks for leann or le followed by ann and hence an indexed term of leann won't match. -Yonik http://www.lucidimagination.com even though both are lower-cased in the end. furthermore, the analysis tool shows a match. the debug query looks like parsedquery:+((DisjunctionMaxQuery((search-en:\(leann le) ann\)) DisjunctionMaxQuery((search-en:rimes)))~2) (), I have a feeling it's due to how the broken up tokens are added back into the token stream with PreserveOriginal, and some strange interaction between that order and dismax, but I'm not entirely sure. configs follow. thoughts appreciated. --Geoff fieldType name=search-en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords-en.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords-en.txt/ /analyzer /fieldType
Re: camel-casing and dismax troubles
On Wed, May 13, 2009 at 6:23 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young ge...@modperlcookbook.org wrote: hi all :) I'm having trouble with camel-cased query strings and the dismax handler. a user query LeAnn Rimes isn't matching the indexed term Leann Rimes This is the camel-case case that can't currently be handled by a single WordDelimiterFilter. If the indexeddoc had LeAnn, then it would be indexed as le,ann/leann and hence queries of both forms le ann and leann would match. However since the indexed term is simply leann, a WordDelimiterFilter configured to split won't match (a search for LeAnn will be translated into a search for le ann. but the concatparts and/or concatall should handle splicing the tokens back together, right? One way to work around this now is to do a copyField into another field that catenates split terms in the query analyzer instead of generating/splitting, and then search across both fields. yeah, unforunately, that's not an option for me :) BTW, your parsed query below shows you turned on both catenation and generation (or perhaps preserveOriginal) for split subwords in your query analyzer. Unfortunately this configuration doesn't work due to the ambiguity of what it means to have multiple terms at the same position (this is the same problem for multi-word synonyms at query time). The query shown below looks for leann or le followed by ann and hence an indexed term of leann won't match. ugh. ok, thanks for letting me know. I'm not using the same concat parameters on the index as the query based on the solr wiki docs. but I've always wondered if that was a good idea. I'll see if matching them up helps at all. thanks. I'll let you know what I find. --Geoff
Re: camel-casing and dismax troubles
On Wed, May 13, 2009 at 12:29 PM, Geoffrey Young ge...@modperlcookbook.org wrote: However since the indexed term is simply leann, a WordDelimiterFilter configured to split won't match (a search for LeAnn will be translated into a search for le ann. but the concatparts and/or concatall should handle splicing the tokens back together, right? Yes, but you can't do both at once on the query side (split and concat)... you have to pick one or the other (hence the workaround of using more than one field). -Yonik http://www.lucidimagination.com
camel-casing and dismax troubles
hi all :) I'm having trouble with camel-cased query strings and the dismax handler. a user query LeAnn Rimes isn't matching the indexed term Leann Rimes even though both are lower-cased in the end. furthermore, the analysis tool shows a match. the debug query looks like parsedquery:+((DisjunctionMaxQuery((search-en:\(leann le) ann\)) DisjunctionMaxQuery((search-en:rimes)))~2) (), I have a feeling it's due to how the broken up tokens are added back into the token stream with PreserveOriginal, and some strange interaction between that order and dismax, but I'm not entirely sure. configs follow. thoughts appreciated. --Geoff fieldType name=search-en class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords-en.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory / filter class=solr.WordDelimiterFilterFactory preserveOriginal=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords-en.txt/ /analyzer /fieldType