Re: camel-casing and dismax troubles

2009-05-13 Thread Yonik Seeley
On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young
ge...@modperlcookbook.org wrote:
 hi all :)

 I'm having trouble with camel-cased query strings and the dismax handler.

 a user query

  LeAnn Rimes

 isn't matching the indexed term

  Leann Rimes

This is the camel-case case that can't currently be handled by a
single WordDelimiterFilter.

If the indexeddoc had LeAnn, then it would be indexed as
le,ann/leann and hence queries of both forms le ann and
leann would match.

However since the indexed term is simply leann, a
WordDelimiterFilter configured to split won't match (a search for
LeAnn will be translated into a search for le ann.

One way to work around this now is to do a copyField into another
field that catenates split terms in the query analyzer instead of
generating/splitting, and then search across both fields.

BTW, your parsed query below shows you turned on both catenation and
generation (or perhaps preserveOriginal) for split subwords in your
query analyzer.  Unfortunately this configuration doesn't work due to
the ambiguity of what it means to have multiple terms at the same
position (this is the same problem for multi-word synonyms at query
time).  The query shown below looks for leann or le followed by
ann and hence an indexed term of leann won't match.

-Yonik
http://www.lucidimagination.com

 even though both are lower-cased in the end.  furthermore, the
 analysis tool shows a match.

 the debug query looks like

  parsedquery:+((DisjunctionMaxQuery((search-en:\(leann le)
 ann\)) DisjunctionMaxQuery((search-en:rimes)))~2) (),

 I have a feeling it's due to how the broken up tokens are added back
 into the token stream with PreserveOriginal, and some strange
 interaction between that order and dismax, but I'm not entirely sure.

 configs follow.  thoughts appreciated.

 --Geoff

  fieldType name=search-en class=solr.TextField
 positionIncrementGap=100
    analyzer type=index
      tokenizer class=solr.WhitespaceTokenizerFactory/
      filter class=solr.ISOLatin1AccentFilterFactory /
      filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
                                                      generateWordParts=1
                                                      generateNumberParts=1
                                                      catenateWords=1
                                                      catenateNumbers=1
                                                      catenateAll=1/
      filter class=solr.LowerCaseFilterFactory/
      filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/
      filter class=solr.StopFilterFactory ignoreCase=false
 words=stopwords-en.txt/
    /analyzer

    analyzer type=query
      tokenizer class=solr.WhitespaceTokenizerFactory/
      filter class=solr.ISOLatin1AccentFilterFactory /
      filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
                                                      generateWordParts=1
                                                      generateNumberParts=1
                                                      catenateWords=0
                                                      catenateNumbers=0
                                                      catenateAll=0/
      filter class=solr.LowerCaseFilterFactory/
      filter class=solr.StopFilterFactory ignoreCase=false
 words=stopwords-en.txt/
    /analyzer
  /fieldType



Re: camel-casing and dismax troubles

2009-05-13 Thread Geoffrey Young
On Wed, May 13, 2009 at 6:23 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young
 ge...@modperlcookbook.org wrote:
 hi all :)

 I'm having trouble with camel-cased query strings and the dismax handler.

 a user query

  LeAnn Rimes

 isn't matching the indexed term

  Leann Rimes

 This is the camel-case case that can't currently be handled by a
 single WordDelimiterFilter.

 If the indexeddoc had LeAnn, then it would be indexed as
 le,ann/leann and hence queries of both forms le ann and
 leann would match.

 However since the indexed term is simply leann, a
 WordDelimiterFilter configured to split won't match (a search for
 LeAnn will be translated into a search for le ann.

but the concatparts and/or concatall should handle splicing the tokens
back together, right?


 One way to work around this now is to do a copyField into another
 field that catenates split terms in the query analyzer instead of
 generating/splitting, and then search across both fields.

yeah, unforunately, that's not an option for me :)


 BTW, your parsed query below shows you turned on both catenation and
 generation (or perhaps preserveOriginal) for split subwords in your
 query analyzer.  Unfortunately this configuration doesn't work due to
 the ambiguity of what it means to have multiple terms at the same
 position (this is the same problem for multi-word synonyms at query
 time).  The query shown below looks for leann or le followed by
 ann and hence an indexed term of leann won't match.

ugh.  ok, thanks for letting me know.

I'm not using the same concat parameters on the index as the query
based on the solr wiki docs.  but I've always wondered if that was a
good idea.  I'll see if matching them up helps at all.

thanks.  I'll let you know what I find.

--Geoff


Re: camel-casing and dismax troubles

2009-05-13 Thread Yonik Seeley
On Wed, May 13, 2009 at 12:29 PM, Geoffrey Young
ge...@modperlcookbook.org wrote:
 However since the indexed term is simply leann, a
 WordDelimiterFilter configured to split won't match (a search for
 LeAnn will be translated into a search for le ann.

 but the concatparts and/or concatall should handle splicing the tokens
 back together, right?

Yes, but you can't do both at once on the query side (split and
concat)... you have to pick one or the other (hence the workaround of
using more than one field).

-Yonik
http://www.lucidimagination.com


camel-casing and dismax troubles

2009-05-12 Thread Geoffrey Young
hi all :)

I'm having trouble with camel-cased query strings and the dismax handler.

a user query

 LeAnn Rimes

isn't matching the indexed term

 Leann Rimes

even though both are lower-cased in the end.  furthermore, the
analysis tool shows a match.

the debug query looks like

 parsedquery:+((DisjunctionMaxQuery((search-en:\(leann le)
ann\)) DisjunctionMaxQuery((search-en:rimes)))~2) (),

I have a feeling it's due to how the broken up tokens are added back
into the token stream with PreserveOriginal, and some strange
interaction between that order and dismax, but I'm not entirely sure.

configs follow.  thoughts appreciated.

--Geoff

  fieldType name=search-en class=solr.TextField
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ISOLatin1AccentFilterFactory /
  filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=1
  catenateNumbers=1
  catenateAll=1/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=false
words=stopwords-en.txt/
/analyzer

analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.ISOLatin1AccentFilterFactory /
  filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=0
  catenateNumbers=0
  catenateAll=0/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory ignoreCase=false
words=stopwords-en.txt/
/analyzer
  /fieldType