On 9/2/14 1:51 PM, Erick Erickson wrote:
bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
not "macbook"

I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?

Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both index and query phases (is that right?):

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

It's hard to cut and paste the results of the analysis page into email (or anywhere!), I'll give you screenshots, sorry -- and I'll give them for our whole real world app complex field definition. I'll also paste in our entire field definition below. But I realize my next step is probably creating a simpler isolation/reproduction case (unless you have a magic answer from this!).

Again, the problem is that "MacBook" seems to be only matching on indexed "macbook" and not indexed "mac book".


"MacBook" query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

"MacBook" index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

"mac book" index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
<!-- the rulefiles thing is to keep ICUTokenizerFactory from stripping punctuation,
            so our synonym filter involving C++ etc can still work.
From: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E the rbbi file is in our local ./conf, copied from lucene source tree --> <tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>

<filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt" ignoreCase="true"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>


        <!-- folding need sto be after WordDelimiter, so WordDelimiter
             can do it's thing with full cases and such -->
        <filter class="solr.ICUFoldingFilterFactory" />


        <!-- ICUFolding already includes lowercasing, no
             need for seperate lowercasing step
        <filter class="solr.LowerCaseFilterFactory"/>
        -->

<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>




Reply via email to