DisMax and WordDelimiterFilterFactory

Demian Katz Tue, 25 Oct 2011 10:13:49 -0700

I've seen a couple of threads related to this subject (for example, 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg33400.html), but I 
haven't found an answer that addresses the aspect of the problem that concerns 
me...


I have a field type set up like this:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
splitOnCaseChange="1" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

The important feature here is the use of WordDelimiterFilterFactory, which 
allows a search for "WiFi" to match an indexed term of "wi fi" (for example).

The problem, of course, is that if a user accidentally introduces a case change 
in their query, the query analyzer chain breaks it into multiple words and no 
hits are found...  so a search for "exaMple" will look for "exa mple" and fail.

I've found two solutions that resolve this problem in the admin panel field 
analysis tool:


1.)    Turn on catenateWords and catenateNumbers in the query analyzer - this 
reassembles the user's broken word and allows a match.

2.)    Turn on preserveOriginal in the query analyzer - this passes through the 
user's original query, which then gets cleaned up bythe ICUFoldingFilterFactory 
and allows a match.

The problem is that in my real-world application, which uses DisMax, neither of 
these solutions work.  It appears that even though (if I understand correctly) 
the WordDelimiterFilterFactory is returning ALTERNATIVE tokens, the DisMax 
handler is combining them a way that requires all of them to match in an 
inappropriate way...  for example, here's partial debugQuery output for the 
"exaMple" search using Dismax and solution #2 above:

    "parsedquery":"+DisjunctionMaxQuery((genre:\"(exampl exa) mple\"^300.0 | 
title_new:\"(exampl exa) mple\"^100.0 | topic:\"(exampl exa) mple\"^500.0 | 
series:\"(exampl exa) mple\"^50.0 | title_full_unstemmed:\"(example exa) 
mple\"^600.0 | geographic:\"(exampl exa) mple\"^300.0 | contents:\"(exampl exa) 
mple\"^10.0 | fulltext_unstemmed:\"(example exa) mple\"^10.0 | 
allfields_unstemmed:\"(example exa) mple\"^10.0 | title_alt:\"(exampl exa) 
mple\"^200.0 | series2:\"(exampl exa) mple\"^30.0 | title_short:\"(exampl exa) 
mple\"^750.0 | author:\"(example exa) mple\"^300.0 | title:\"(exampl exa) 
mple\"^500.0 | topic_unstemmed:\"(example exa) mple\"^550.0 | 
allfields:\"(exampl exa) mple\" | author_fuller:\"(example exa) mple\"^150.0 | 
title_full:\"(exampl exa) mple\"^400.0 | fulltext:\"(exampl exa) mple\")) ()",

Obviously, that is not what I want - ideally it would be something like 'exampl 
OR "ex ample"'.

I also read about the autoGeneratePhraseQueries setting, but that seems to take 
things way too far in the opposite direction - if I set that to false, then I 
get matches for any individual token; i.e. example OR ex OR ample - not good at 
all!

I have a sinking suspicion that there is not an easy solution to my problem, 
but this seems to be a fairly basic need; splitOnCaseChange is a useful feature 
to have, but it's more valuable if it serves as an ALTERNATIVE search rather 
than a necessary query munge.  Any thoughts?

thanks,
Demian

DisMax and WordDelimiterFilterFactory

Reply via email to