Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Jack Krupansky Sat, 17 May 2014 11:44:28 -0700

Your bad experience seems to have occurred because you chose to use alldefault values for the WDF attributes. In particular, the generateWordPartsand generateNumberParts attributes default to "1" (true), resulting in thediscrete "abc", "123", and "xyz" tokens, and the catenateAll attributedefaults to "0" (false), which means that the "abc123xyz" token is notgenerated by that attribute, although "abc123xyz" is generated because youexplicitly specified the preserveOriginal attribute to be "1".

Generally, you need to have asymmetric WDF analyzers, one for indexing thatgenerates multiple terms for better recall, and one for query that generatesonly a sequence of the sub-terms (as if a quoted phrase) for more precisematching. So, it's fine to use preserveOriginal="1" for indexing, as well ascatenateAll="1" and generateNumberParts="1" and generateWordParts="1", butfor query analysis you should have preserveOriginal="0", catenateAll="0" andcatenateWordParts="0" and catenateNumberParts="0" andgenerateNumberParts="1" and generateWordParts="1".

The distinction between preserveOriginal and catenateAll is whetherpunctuation should be included (for the former) or stripped out (thelatter):


abc. => abc. vs. abc

(xyz). => (xyz). vs. xyz

401(k). => 401(k). vs. 401 k

CD-ROM. => CD-ROM. vs. CD ROM

Finally, the default for the splitOnNumerics attribute is "1" (true), whichis why "abc123xyz" is split into three terms. If you don't want that split,set splitOnNumerics="0".


There are more details on WDF in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html


-- Jack Krupansky

-----Original Message-----From: Alexandre Rafalovitch

Sent: Saturday, May 17, 2014 1:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

My understanding was that the lower-case and other things happen on
per-field basis and is a step after the dismax formula is applied. In
this case, however, this seems to be happening before:
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz)

Hence to question to someone who actually understands those guts. For
eDisMax, what's the correct/expected call sequence between query
parser and field-type parser? Or maybe just a slightly more in-depth
explanation of Michael's statement.

Regards,
  Alex.

Personal website: http://www.outerthoughts.com/

Current project: http://www.solr-start.com/ - Accelerating your Solrproficiency



On Sat, May 17, 2014 at 8:28 PM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:

Alex - the query parsers generally accept an analyzer, which they mustapply

after they perform their own tokenization.  Consider: how would a
capitalized query term match lower-cased terms in the index without query
analysis?

-Mike


On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:


Hello,

I am getting weird results that seem to come from eDisMax using
analyzer chain to break the input text. I have
WordDelimiterFilterFactory in my chain, which does a lot of
interesting things I did not expect query parser to be involved in.

Specifically, the string "abc123XYZ" gets split into 3 components on
digits and gets lowercased as well. I thought all that was happening
later, inside individual fields.

All documentation talks about query parsers splitting on space, so I
don't know where this "full chain" business is coming from. Or maybe I
am misunderstanding which phase debug output is from.

Here is the field definition:
     <fieldType name="wdText" class="solr.TextField" >
         <analyzer>
             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
             <filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" />
             <filter class="solr.LowerCaseFilterFactory" />
         </analyzer>
     </fieldType>
     <fieldType name="wsText" class="solr.TextField"
positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       </analyzer>
     </fieldType>

     <field name="wdText"      type="wdText" indexed="true" stored="true"
/>
     <field name="wsText"      type="wsText" indexed="true" stored="true"
/>

And here is the debug output:

http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true

    "rawquerystring":"hello big world abc123XYZ",
     "querystring":"hello big world abc123XYZ",
     "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
DisjunctionMaxQuery((wdText:world | wsText:world))
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
wdText:xyz) | wsText:abc123XYZ))))/no_coord",
     "parsedquery_toString":"+((wdText:hello | wsText:hello)
(wdText:big | wsText:big) (wdText:world | wsText:world)
(((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
wsText:abc123XYZ))",

Or, and enabling phrase search on the field type, gets even more
weird. But one problem at a time.

Regards,
    Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency

Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

Reply via email to