Re: WordDelimiter filter, expanding to multiple words, unexpected results

Diego Fernandez Tue, 02 Sep 2014 15:01:49 -0700

Although not a solution, this may help in trying to find the problem.
In http://solr.pl/en/2010/08/16/what-is-schema-xml/ it says:


"It is worth noting that there is an additional attribute for the text field 
type:

    autoGeneratePhraseQueries

This attribute is responsible for telling filters how to behave when dividing 
tokens. Some filters (such as WordDelimiterFilter) can divide tokens into a set 
of tokens. Setting the attribute to true (default value) will automatically 
generate phrase queries. This means that WordDelimiterFilter will divide the 
word “wi-fi” into two tokens “wi” and “fi”. With autoGeneratePhraseQueries set 
to true query sent to Lucene will look like "field:wi fi", while with set to 
false Lucene query will look like field:wi OR field:fi. However, please note, 
that this attribute only behaves well with tokenizers based on white spaces."

Since phrases are made by looking at the position, it is possible that the 
position set for the other generated tokens have something to do with it.  Have 
you tried turning autoGeneratePhraseQueries="false" to see if it'll match both? 
(I know that might have other unintended behaviors but it might give some 
insight into the problem)

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



----- Original Message -----
> On 9/2/14 1:51 PM, Erick Erickson wrote:
> > bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
> > not "macbook"
> >
> > I suspect your query parameters for WordDelimiterFilterFactory doesn't have
> > catenate words set.
> >
> > What do you see when you enter these in both the index and query portions
> > of the admin/analysis page?
> 
> Thanks Erick!
> 
> Our WordDelimiterFilterFactory does have catenate words set, in both
> index and query phases (is that right?):
> 
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> 
> It's hard to cut and paste the results of the analysis page into email
> (or anywhere!), I'll give you screenshots, sorry -- and I'll give them
> for our whole real world app complex field definition. I'll also paste
> in our entire field definition below. But I realize my next step is
> probably creating a simpler isolation/reproduction case (unless you have
> a magic answer from this!).
> 
> Again, the problem is that "MacBook" seems to be only matching on
> indexed "macbook" and not indexed "mac book".
> 
> 
> "MacBook" query analysis:
> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
> 
> "MacBook" index analysis:
> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
> 
> "mac book" index analysis:
> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
> 
> 
> Our entire actual field definition:
> 
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>        <analyzer>
>         <!-- the rulefiles thing is to keep ICUTokenizerFactory from
> stripping punctuation,
>              so our synonym filter involving C++ etc can still work.
>              From:
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3c51965e70.6070...@elyograg.org%3E
>              the rbbi file is in our local ./conf, copied from lucene
> source tree -->
>         <tokenizer class="solr.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
> 
>         <filter class="solr.SynonymFilterFactory"
> synonyms="punctuation-whitelist.txt" ignoreCase="true"/>
> 
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
>          <!-- folding need sto be after WordDelimiter, so WordDelimiter
>               can do it's thing with full cases and such -->
>          <filter class="solr.ICUFoldingFilterFactory" />
> 
> 
>          <!-- ICUFolding already includes lowercasing, no
>               need for seperate lowercasing step
>          <filter class="solr.LowerCaseFilterFactory"/>
>          -->
> 
>          <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>      </fieldType>
> 
> 
> 
> 
>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to