Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jack Krupansky Mon, 29 Dec 2014 14:25:38 -0800

WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction -
the index analyzer would index as you have indicated, indexing both the
unitary term and the multi-term phrase, while the query analyzer would NOT
do the split on case, so that the query could be a unitary term (possibly
with mixed case, but that would not split the term) or could be a two-word
phrase.


-- Jack Krupansky


-- Jack Krupansky

On Mon, Dec 29, 2014 at 5:12 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

> Okay, some months later I've come back to this with an isolated
> reproduction case. Thanks very much for any advice or debugging help you
> can give.
>
> The WordDelimiter filter is making a mixed-case query NOT match the
> single-case source, when it ought to.
>
> I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
> sense to debug here, and I need to install and try to reproduce on a more
> recent version).
>
> I have an index that includes ONE document (deleted and reindexed after
> index change), with content in only one field ("text") other than 'id', and
> that content is one word: "delalain".
>
> My analysis (both index and query, I don't have different ones) for the
> 'text' field is simply:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.ICUTokenizerFactory" />
>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" catenateWords="1" splitOnCaseChange="1"/>
>
>         <filter class="solr.ICUFoldingFilterFactory" />
>       </analyzer>
> </fieldType>
>
> I am querying simply with eg /select?defType=lucene&q=text%3Adelalain
>
> Querying for "delalain" finds this document, as expected. Querying for
> "DELALAIN" finds this document, as expected (note the ICUFoldingFactory).
>
> However, querying for "deLALAIN" does not find this document, which is
> unexpected.
>
> INDEX analysis of the source, "delalain", ends in this in the index, which
> seems pretty straightforward, so I'll only bother pasting in the final
> index analysis:
>
> ######
> text    delalain
> raw_bytes       [64 65 6c 61 6c 61 69 6e]
> position        1
> start   0
> end     8
> type    <ALPHANUM>
> script  Latin
> #######
>
>
>
>
> QUERY analysis of the problematic query, "deLALAIN", looks like this:
>
> #####
> ICUT    text    deLALAIN
>         raw_bytes       [64 65 4c 41 4c 41 49 4e]
>         start   0
>         end     8
>         type    <ALPHANUM>
>         script  Latin
>         position        1
>
>
> WDF     text    de      LALAIN  deLALAIN
>         raw_bytes       [64 65] [4c 41 4c 41 49 4e]     [64 65 4c 41 4c 41
> 49 4e]
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         position        1       2       2
>         script  Common  Common  Common
>
>
> ICUFF   text    de      lalain  delalain
>         raw_bytes       [64 65] [6c 61 6c 61 69 6e]     [64 65 6c 61 6c 61
> 69 6e]
>         position        1       2       2
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         script  Common  Common  Common
> #######
>
>
>
> It's obviously the WordDelimiterFilter that is messing things up -- but
> how/why, and is it a bug?
>
> It wants to search for both "de lalain" as a phrase, as well as
> alternately "delalain" as one word -- that's the intended supported point
> of the WDF with this configuration, right? And should work?
>
> The problem is that is not succesfully matching "delalain" as one word --
> so, how to figure out why not and what to do about it?
>
> Previously, Erick and Diego asked for the info from &debug=query, so here
> is that as well:
>
> ####
> <lst name="debug">
>   <str name="rawquerystring">text:deLALAIN</str>
>   <str name="querystring">text:deLALAIN</str>
>   <str name="parsedquery">MultiPhraseQuery(text:"de (lalain
> delalain)")</str>
>   <str name="parsedquery_toString">text:"de (lalain delalain)"</str>
>   <str name="QParser">LuceneQParser</str>
> </lst>
> ####
>
> Hmm, that does not seem to quite look like neccesarily, if I interpret
> that correctly, it's looking for "de" followed by either "lalain" or
> "delalain".  Ie, it would match "de delalain"?  But that's not right at all.
>
> So, what's gone wrong? Something with WDF with configuration to
> generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's
> a bug, one that might be fixed in a more recent Solr?).
>
> Thanks!
>
> Jonathan
>
>
>
>
> On 9/3/14 7:15 PM, Erick Erickson wrote:
>
>> Jonathan:
>>
>> If at all possible, delete your collection/data directory (the whole
>> directory, including data) between runs after you've changed
>> your schema (at least any of your analysis that pertains to indexing).
>> Mixing old and new schema definitions can add to the confusion!
>>
>> Good luck!
>> Erick
>>
>> On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind <rochk...@jhu.edu>
>> wrote:
>>
>>> Thanks Erick and Diego. Yes, I noticed in my last message I'm not
>>> actually
>>> using defaults, not sure why I chose non-defaults originally.
>>>
>>> I still need to find time to make a smaller isolation/reproduction case,
>>> I'm
>>> getting confusing results that suggest some other part of my field def
>>> may
>>> be pertinent.
>>>
>>> I'll come back when I've done that (hopefully next week), and include the
>>> _parsed_ from &debug=query then. Thanks!
>>>
>>> Jonathan
>>>
>>>
>>>
>>> On 9/2/14 4:26 PM, Erick Erickson wrote:
>>>
>>>>
>>>> What happens if you append &debug=query to your query? IOW, what does
>>>> the
>>>> _parsed_ query look like?
>>>>
>>>> Also note that the defaults for WDFF are _not_ identical. catenateWords
>>>> and
>>>> catenateNumbers are 1 in the
>>>> index portion and 0 in the query section. Still, this shouldn't be a
>>>> problem all other things being equal.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <rochk...@jhu.edu>
>>>> wrote:
>>>>
>>>>  On 9/2/14 1:51 PM, Erick Erickson wrote:
>>>>>
>>>>>  bq: In my actual index, query "MacBook" is matching ONLY "mac book",
>>>>>> and
>>>>>> not "macbook"
>>>>>>
>>>>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>>>>> have
>>>>>> catenate words set.
>>>>>>
>>>>>> What do you see when you enter these in both the index and query
>>>>>> portions
>>>>>> of the admin/analysis page?
>>>>>>
>>>>>>
>>>>> Thanks Erick!
>>>>>
>>>>> Our WordDelimiterFilterFactory does have catenate words set, in both
>>>>> index
>>>>> and query phases (is that right?):
>>>>>
>>>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>>>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>>>> catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>> It's hard to cut and paste the results of the analysis page into email
>>>>> (or
>>>>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for
>>>>> our
>>>>> whole real world app complex field definition. I'll also paste in our
>>>>> entire field definition below. But I realize my next step is probably
>>>>> creating a simpler isolation/reproduction case (unless you have a magic
>>>>> answer from this!).
>>>>>
>>>>> Again, the problem is that "MacBook" seems to be only matching on
>>>>> indexed
>>>>> "macbook" and not indexed "mac book".
>>>>>
>>>>>
>>>>> "MacBook" query analysis:
>>>>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>>>>
>>>>> "MacBook" index analysis:
>>>>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>>>>
>>>>> "mac book" index analysis:
>>>>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>>>>
>>>>>
>>>>> Our entire actual field definition:
>>>>>
>>>>>     <fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100"
>>>>> autoGeneratePhraseQueries="true">
>>>>>         <analyzer>
>>>>>          <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>>>>> stripping punctuation,
>>>>>               so our synonym filter involving C++ etc can still work.
>>>>>               From: https://mail-archives.apache.
>>>>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>>>>> 6070...@elyograg.org%3E
>>>>>               the rbbi file is in our local ./conf, copied from lucene
>>>>> source tree -->
>>>>>          <tokenizer class="solr.ICUTokenizerFactory"
>>>>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>>>>
>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="punctuation-whitelist.txt"
>>>>> ignoreCase="true"/>
>>>>>
>>>>>           <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>
>>>>>           <!-- folding need sto be after WordDelimiter, so
>>>>> WordDelimiter
>>>>>                can do it's thing with full cases and such -->
>>>>>           <filter class="solr.ICUFoldingFilterFactory" />
>>>>>
>>>>>
>>>>>           <!-- ICUFolding already includes lowercasing, no
>>>>>                need for seperate lowercasing step
>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>           -->
>>>>>
>>>>>           <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English" protected="protwords.txt"/>
>>>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>         </analyzer>
>>>>>       </fieldType>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Reply via email to