Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind Tue, 02 Sep 2014 12:44:17 -0700

On 9/2/14 1:51 PM, Erick Erickson wrote:

bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
not "macbook"


I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?


Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in bothindex and query phases (is that right?):

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="1" catenateNumbers="1"catenateAll="0" splitOnCaseChange="1"/>

It's hard to cut and paste the results of the analysis page into email(or anywhere!), I'll give you screenshots, sorry -- and I'll give themfor our whole real world app complex field definition. I'll also pastein our entire field definition below. But I realize my next step isprobably creating a simpler isolation/reproduction case (unless you havea magic answer from this!).

Again, the problem is that "MacBook" seems to be only matching onindexed "macbook" and not indexed "mac book".



"MacBook" query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

"MacBook" index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

"mac book" index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

<fieldType name="text" class="solr.TextField"positionIncrementGap="100" autoGeneratePhraseQueries="true">

      <analyzer>

<!-- the rulefiles thing is to keep ICUTokenizerFactory fromstripping punctuation,

            so our synonym filter involving C++ etc can still work.

From:https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3Ethe rbbi file is in our local ./conf, copied from lucenesource tree --><tokenizer class="solr.ICUTokenizerFactory"rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>

<filter class="solr.SynonymFilterFactory"synonyms="punctuation-whitelist.txt" ignoreCase="true"/>

<filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="1"catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>



        <!-- folding need sto be after WordDelimiter, so WordDelimiter
             can do it's thing with full cases and such -->
        <filter class="solr.ICUFoldingFilterFactory" />


        <!-- ICUFolding already includes lowercasing, no
             need for seperate lowercasing step
        <filter class="solr.LowerCaseFilterFactory"/>
        -->

<filter class="solr.SnowballPorterFilterFactory"language="English" protected="protwords.txt"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to