Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind Wed, 03 Sep 2014 08:49:47 -0700

Thanks Erick and Diego. Yes, I noticed in my last message I'm notactually using defaults, not sure why I chose non-defaults originally.

I still need to find time to make a smaller isolation/reproduction case,I'm getting confusing results that suggest some other part of my fielddef may be pertinent.

I'll come back when I've done that (hopefully next week), and includethe _parsed_ from &debug=query then. Thanks!


Jonathan


On 9/2/14 4:26 PM, Erick Erickson wrote:

What happens if you append &debug=query to your query? IOW, what does the
_parsed_ query look like?

Also note that the defaults for WDFF are _not_ identical. catenateWords and
catenateNumbers are 1 in the
index portion and 0 in the query section. Still, this shouldn't be a
problem all other things being equal.

Best,
Erick


On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

On 9/2/14 1:51 PM, Erick Erickson wrote:

bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
not "macbook"

I suspect your query parameters for WordDelimiterFilterFactory doesn't
have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?


Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both index
and query phases (is that right?):

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>

It's hard to cut and paste the results of the analysis page into email (or
anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
whole real world app complex field definition. I'll also paste in our
entire field definition below. But I realize my next step is probably
creating a simpler isolation/reproduction case (unless you have a magic
answer from this!).

Again, the problem is that "MacBook" seems to be only matching on indexed
"macbook" and not indexed "mac book".


"MacBook" query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

"MacBook" index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

"mac book" index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
       <analyzer>
        <!-- the rulefiles thing is to keep ICUTokenizerFactory from
stripping punctuation,
             so our synonym filter involving C++ etc can still work.
             From: https://mail-archives.apache.
org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
6070...@elyograg.org%3E
             the rbbi file is in our local ./conf, copied from lucene
source tree -->
        <tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>

        <filter class="solr.SynonymFilterFactory" 
synonyms="punctuation-whitelist.txt"
ignoreCase="true"/>

         <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>


         <!-- folding need sto be after WordDelimiter, so WordDelimiter
              can do it's thing with full cases and such -->
         <filter class="solr.ICUFoldingFilterFactory" />


         <!-- ICUFolding already includes lowercasing, no
              need for seperate lowercasing step
         <filter class="solr.LowerCaseFilterFactory"/>
         -->

         <filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
     </fieldType>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to