On 9/2/14 1:51 PM, Erick Erickson wrote:
bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
not "macbook"
I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.
What do you see when you enter these in both the index and query portions
of the admin/analysis page?
Thanks Erick!
Our WordDelimiterFilterFactory does have catenate words set, in both
index and query phases (is that right?):
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
It's hard to cut and paste the results of the analysis page into email
(or anywhere!), I'll give you screenshots, sorry -- and I'll give them
for our whole real world app complex field definition. I'll also paste
in our entire field definition below. But I realize my next step is
probably creating a simpler isolation/reproduction case (unless you have
a magic answer from this!).
Again, the problem is that "MacBook" seems to be only matching on
indexed "macbook" and not indexed "mac book".
"MacBook" query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
"MacBook" index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
"mac book" index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
Our entire actual field definition:
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<!-- the rulefiles thing is to keep ICUTokenizerFactory from
stripping punctuation,
so our synonym filter involving C++ etc can still work.
From:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E
the rbbi file is in our local ./conf, copied from lucene
source tree -->
<tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
<filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<!-- folding need sto be after WordDelimiter, so WordDelimiter
can do it's thing with full cases and such -->
<filter class="solr.ICUFoldingFilterFactory" />
<!-- ICUFolding already includes lowercasing, no
need for seperate lowercasing step
<filter class="solr.LowerCaseFilterFactory"/>
-->
<filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>