Your bad experience seems to have occurred because you chose to use all default values for the WDF attributes. In particular, the generateWordParts and generateNumberParts attributes default to "1" (true), resulting in the discrete "abc", "123", and "xyz" tokens, and the catenateAll attribute defaults to "0" (false), which means that the "abc123xyz" token is not generated by that attribute, although "abc123xyz" is generated because you explicitly specified the preserveOriginal attribute to be "1".

Generally, you need to have asymmetric WDF analyzers, one for indexing that generates multiple terms for better recall, and one for query that generates only a sequence of the sub-terms (as if a quoted phrase) for more precise matching. So, it's fine to use preserveOriginal="1" for indexing, as well as catenateAll="1" and generateNumberParts="1" and generateWordParts="1", but for query analysis you should have preserveOriginal="0", catenateAll="0" and catenateWordParts="0" and catenateNumberParts="0" and generateNumberParts="1" and generateWordParts="1".

The distinction between preserveOriginal and catenateAll is whether punctuation should be included (for the former) or stripped out (the latter):

abc. => abc. vs. abc

(xyz). => (xyz). vs. xyz

401(k). => 401(k). vs. 401 k

CD-ROM. => CD-ROM. vs. CD ROM

Finally, the default for the splitOnNumerics attribute is "1" (true), which is why "abc123xyz" is split into three terms. If you don't want that split, set splitOnNumerics="0".

There are more details on WDF in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html


-- Jack Krupansky

-----Original Message----- From: Alexandre Rafalovitch
Sent: Saturday, May 17, 2014 1:13 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.8: Does eDisMax parser calls analyzer chain to tokenize?

My understanding was that the lower-case and other things happen on
per-field basis and is a step after the dismax formula is applied. In
this case, however, this seems to be happening before:
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz)

Hence to question to someone who actually understands those guts. For
eDisMax, what's the correct/expected call sequence between query
parser and field-type parser? Or maybe just a slightly more in-depth
explanation of Michael's statement.

Regards,
  Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Sat, May 17, 2014 at 8:28 PM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:
Alex - the query parsers generally accept an analyzer, which they must apply
after they perform their own tokenization.  Consider: how would a
capitalized query term match lower-cased terms in the index without query
analysis?

-Mike


On 5/17/2014 4:05 AM, Alexandre Rafalovitch wrote:

Hello,

I am getting weird results that seem to come from eDisMax using
analyzer chain to break the input text. I have
WordDelimiterFilterFactory in my chain, which does a lot of
interesting things I did not expect query parser to be involved in.

Specifically, the string "abc123XYZ" gets split into 3 components on
digits and gets lowercased as well. I thought all that was happening
later, inside individual fields.

All documentation talks about query parsers splitting on space, so I
don't know where this "full chain" business is coming from. Or maybe I
am misunderstanding which phase debug output is from.

Here is the field definition:
     <fieldType name="wdText" class="solr.TextField" >
         <analyzer>
             <tokenizer class="solr.WhitespaceTokenizerFactory"/>
             <filter class="solr.WordDelimiterFilterFactory"
preserveOriginal="1" />
             <filter class="solr.LowerCaseFilterFactory" />
         </analyzer>
     </fieldType>
     <fieldType name="wsText" class="solr.TextField"
positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       </analyzer>
     </fieldType>

     <field name="wdText"      type="wdText" indexed="true" stored="true"
/>
     <field name="wsText"      type="wsText" indexed="true" stored="true"
/>

And here is the debug output:

http://localhost:9000/solr/collection1/select?q=hello+big+world+abc123XYZ&wt=json&indent=true&debugQuery=true&defType=edismax&qf=wdText+wsText&stopwords=true&lowercaseOperators=true

    "rawquerystring":"hello big world abc123XYZ",
     "querystring":"hello big world abc123XYZ",
     "parsedquery":"(+(DisjunctionMaxQuery((wdText:hello |
wsText:hello)) DisjunctionMaxQuery((wdText:big | wsText:big))
DisjunctionMaxQuery((wdText:world | wsText:world))
DisjunctionMaxQuery((((wdText:abc123xyz wdText:abc) wdText:123
wdText:xyz) | wsText:abc123XYZ))))/no_coord",
     "parsedquery_toString":"+((wdText:hello | wsText:hello)
(wdText:big | wsText:big) (wdText:world | wsText:world)
(((wdText:abc123xyz wdText:abc) wdText:123 wdText:xyz) |
wsText:abc123XYZ))",

Or, and enabling phrase search on the field type, gets even more
weird. But one problem at a time.

Regards,
    Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency



Reply via email to