Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Erick Erickson Wed, 15 Jun 2011 12:47:47 -0700

Jonathan:

Thanks for writing that up, you're right, it is arcane....


I've starred this one!

Erick

>
> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
> http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/
>
> So to understand, first familiarize yourself with that.
>
> However, none of the fields involved here had any stopwords at all, so at
> first it wasn't obvious this was the problem. But having different
> tokenization and other analysis between fields can result in exactly the
> same problem, for certain queries.
>
> One field in the dismax qf used an analyzer that stripped punctuation. (I'm
> actually not positive at this point _which_ analyzer in my chain was
> stripping punctuation, I'm using a bunch including some custom ones, but I
> was aware that punctuation was being stripped, this was intentional.)
>
> So "monkey's" turns into "monkey".  "monkey:" turns into "monkey".  So far
> so good. But what happens if you have punctuation all by itself seperated by
> whitespace?  "Roosevlet & Churchill" turns into ['roosevelt', 'churchill'].
>  That ampersand in the middle was stripped out, essentially _just as if_ it
> were a stopword. Only two tokens result from that input.
>
> You can see where this is going -- another field involved in the dismax qf
> did NOT strip out punctuation. So three tokens result from that input,
> ['Roosevelt', '&', 'Churchill'].
>
> Now we have exactly the situation that gives ride the dismax stopwords
> mm-behaving-funny situation, it's exactly the same thing.
>
> Now I've fixed this for punctuation just by making those fields strip out
> punctuation, by adding these analyzers to the bottom of those
> previously-not-stripping-punctuation field definitions:
>
> <!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
> <filter class="solr.PatternReplaceFilterFactory"
>                pattern="([\p{Punct}])" replacement="" replace="all"
>        />
> <!-- if after stripping punc we have any 0-length tokens, make
>              sure to eliminate them. We can use LengthFilter min=1 for that,
>              we dont' care about the max here, just a very large number. -->
> <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>
>
> And things are working are how I expect again, at least for this punctuation
> issue. But there may be other edge cases where differences in analysis
> result in different number of tokens from different fields, which if they
> are both included in a dismax qf, will have bad effects on 'mm'.
>
> The lesson I think, is that the only absolute safe way to use dismax 'mm',
> is when all fields in the 'qf' have exactly the same analysis.  But
> obviously that's not very practical, it destroys much of the power of
> dismax. And some differences in analysis are certainly acceptable -- but
> it's rather tricky to figure out if your differences in analysis are going
> to be significant for this problem, under what input, and if so fix them. It
> is not an easy thing to do.  So dismax definitely has this gotcha
> potentially waiting for you, whenever mixing fields with different analysis
> in a 'qf'.
>
>
> On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:
>>
>> Okay, let's try the debug trace again without a pf to be less confusing.
>>
>> One field in qf, that's ordinary text tokenized, and does get hits:
>>
>>
>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
>>
>> <str name="rawquerystring">churchill : roosevelt</str>
>> <str name="querystring">churchill : roosevelt</str>
>> <str name="parsedquery">
>> +((DisjunctionMaxQuery((title1_t:churchil)~0.01)
>> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
>> </str>
>> <str name="parsedquery_toString">
>> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
>> </str>
>>
>> And that gets 25 hits. Now we add in a second field to the qf, this second
>> field is also ordinarily tokenized. We expect no _fewer_ than 25 hits,
>> adding another field into qf, right? And indeed it still results in exactly
>> 25 hits (no additional hits from the additional qf field).
>>
>>
>> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
>>
>> <str name="parsedquery">
>> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)
>> DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()
>> </str>
>> <str name="parsedquery_toString">
>> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |
>> title1_t:roosevelt)~0.01)~2) ()
>> </str>
>>
>>
>>
>> Okay, now we go back to just that first (ordinarily tokenized) field, but
>> add a second field in that uses KeywordTokenizerFactory.  We expect this not
>> neccesarily to ever match for a multi-word query, but we don't expect it to
>> be fewer than 25 hits, the 25 hits from the first field in the qf should
>> still be there, right? But it's not. What happened, why not?
>>
>>
>> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
>>
>>
>> str name="rawquerystring">churchill : roosevelt</str>
>> <str name="querystring">churchill : roosevelt</str>
>> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |
>> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)
>> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)
>> ()</str>
>> <str name="parsedquery_toString">+(((isbn_t:churchill |
>> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |
>> title1_t:roosevelt)~0.01)~3) ()</str>
>>
>>
>>
>> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
>>>
>>> I'm aware that using a field tokenized with KeywordTokenizerFactory is in
>>> a dismax 'qf' is often going to result in 0 hits on that field -- (when a
>>> whitespace-containing query is entered).  But I do it anyway, for cases
>>> where a non-whitespace-containing query is entered, then it hits.  And in
>>> those cases where it doesn't hit, I figure okay, well, the other fields in
>>> qf will hit or not, that's good enough.
>>>
>>> And usually that works. But it works _differently_ when my query contains
>>> an ampersand (or any other punctuation), result in 0 hits when it shoudln't,
>>> and I can't figure out why.
>>>
>>> basically,
>>>
>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>>
>>> gets hits.  The ":" is thrown out the text_field, but the mm still passes
>>> somehow, right?
>>>
>>> But, in the same index:
>>>
>>> &defType=dismax&mm=100%&q=one : two&qf=text_field
>>> keyword_tokenized_text_field
>>>
>>> gets 0 hits.  Somehow maybe the inclusion of the
>>> keyword_tokenized_text_field in the qf causes dismax to calculate the mm
>>> differently, decide there are three tokens in there and they all must match,
>>> and the token ":" can never match because it's not in my index it's stripped
>>> out... but somehow this isn't a problem unless I include a keyword-tokenized
>>>  field in the qf?
>>>
>>> This is really confusing, if anyone has any idea what I'm talking about
>>> it and can shed any light on it, much appreciated.
>>>
>>> The conclusion I am reaching is just NEVER include anything but a more or
>>> less ordinarily tokenized field in a dismax qf. Sadly, it was useful for
>>> certain use cases for me.
>>>
>>> Oh, hey, the debugging trace woudl probably be useful:
>>>
>>>
>>> <lstname="debug">
>>> <strname="rawquerystring">
>>> churchill : roosevelt
>>> </str>
>>> <strname="querystring">
>>> churchill : roosevelt
>>> </str>
>>> <strname="parsedquery">
>>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)
>>> DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt |
>>> title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:"churchill
>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>> roosevelt"~3^80.0)~0.01)
>>> </str>
>>> <strname="parsedquery_toString">
>>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01
>>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:"churchill
>>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil
>>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |
>>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil
>>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |
>>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill
>>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 |
>>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill
>>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 |
>>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill
>>> roosevelt"~3^80.0)~0.01
>>> </str>
>>> <lstname="explain"/>
>>> <strname="QParser">
>>> DisMaxQParser
>>> </str>
>>> <nullname="altquerystring"/>
>>> <nullname="boostfuncs"/>
>>> <lstname="timing">
>>> <doublename="time">
>>> 6.0
>>> </double>
>>> <lstname="prepare">
>>> <doublename="time">
>>> 3.0
>>> </double>
>>> <lstname="org.apache.solr.handler.component.QueryComponent">
>>> <doublename="time">
>>> 2.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.FacetComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.HighlightComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.StatsComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.SpellCheckComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> <lstname="org.apache.solr.handler.component.DebugComponent">
>>> <doublename="time">
>>> 0.0
>>> </double>
>>> </lst>
>>> </lst>
>>>
>>>
>>>
>>
>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Reply via email to