Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Jonathan Rochkind Wed, 15 Jun 2011 11:26:52 -0700

Okay, I figured this one out -- I'm participating in a thread withmyself here, but for benefit of posterity, or if anyone's interested,it's kind of interesting.

It's actually a variation of the known issue with dismax, mm, and fieldswith varying stopwords. Actually a pretty tricky problem with dismax,which it's now clear goes way beyond just stopwords.


http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/

So to understand, first familiarize yourself with that.

However, none of the fields involved here had any stopwords at all, soat first it wasn't obvious this was the problem. But having differenttokenization and other analysis between fields can result in exactly thesame problem, for certain queries.

One field in the dismax qf used an analyzer that stripped punctuation.(I'm actually not positive at this point _which_ analyzer in my chainwas stripping punctuation, I'm using a bunch including some custom ones,but I was aware that punctuation was being stripped, this was intentional.)

So "monkey's" turns into "monkey". "monkey:" turns into "monkey". Sofar so good. But what happens if you have punctuation all by itselfseperated by whitespace? "Roosevlet & Churchill" turns into['roosevelt', 'churchill']. That ampersand in the middle was strippedout, essentially _just as if_ it were a stopword. Only two tokens resultfrom that input.

You can see where this is going -- another field involved in the dismaxqf did NOT strip out punctuation. So three tokens result from thatinput, ['Roosevelt', '&', 'Churchill'].

Now we have exactly the situation that gives ride the dismax stopwordsmm-behaving-funny situation, it's exactly the same thing.

Now I've fixed this for punctuation just by making those fields stripout punctuation, by adding these analyzers to the bottom of thosepreviously-not-stripping-punctuation field definitions:


<!-- strip punctuation, to avoid dismax stopwords-like mm bug -->
<filter class="solr.PatternReplaceFilterFactory"
                pattern="([\p{Punct}])" replacement="" replace="all"
        />
<!-- if after stripping punc we have any 0-length tokens, make

sure to eliminate them. We can use LengthFilter min=1 forthat,we dont' care about the max here, just a very largenumber. -->

<filter class="solr.LengthFilterFactory" min="1" max="100"/>

And things are working are how I expect again, at least for thispunctuation issue. But there may be other edge cases where differencesin analysis result in different number of tokens from different fields,which if they are both included in a dismax qf, will have bad effects on'mm'.

The lesson I think, is that the only absolute safe way to use dismax'mm', is when all fields in the 'qf' have exactly the same analysis.But obviously that's not very practical, it destroys much of the powerof dismax. And some differences in analysis are certainly acceptable --but it's rather tricky to figure out if your differences in analysis aregoing to be significant for this problem, under what input, and if sofix them. It is not an easy thing to do. So dismax definitely has thisgotcha potentially waiting for you, whenever mixing fields withdifferent analysis in a 'qf'.



On 6/14/2011 5:25 PM, Jonathan Rochkind wrote:

Okay, let's try the debug trace again without a pf to be less confusing.

One field in qf, that's ordinary text tokenized, and does get hits:
q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=
<str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">
+((DisjunctionMaxQuery((title1_t:churchil)~0.01)DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()
</str>
And that gets 25 hits. Now we add in a second field to the qf, thissecond field is also ordinarily tokenized. We expect no _fewer_ than25 hits, adding another field into qf, right? And indeed it stillresults in exactly 25 hits (no additional hits from the additional qffield).
?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=
<str name="parsedquery">
+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01)DisjunctionMaxQuery((title2_t:roosevelt |title1_t:roosevelt)~0.01))~2) ()
</str>
<str name="parsedquery_toString">
+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt |title1_t:roosevelt)~0.01)~2) ()
</str>
Okay, now we go back to just that first (ordinarily tokenized) field,but add a second field in that uses KeywordTokenizerFactory. Weexpect this not neccesarily to ever match for a multi-word query, butwe don't expect it to be fewer than 25 hits, the 25 hits from thefirst field in the qf should still be there, right? But it's not. Whathappened, why not?
q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=
str name="rawquerystring">churchill : roosevelt</str>
<str name="querystring">churchill : roosevelt</str>
<str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill |title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01)DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)()</str><str name="parsedquery_toString">+(((isbn_t:churchill |title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt |title1_t:roosevelt)~0.01)~3) ()</str>
On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
I'm aware that using a field tokenized with KeywordTokenizerFactoryis in a dismax 'qf' is often going to result in 0 hits on that field-- (when a whitespace-containing query is entered). But I do itanyway, for cases where a non-whitespace-containing query is entered,then it hits. And in those cases where it doesn't hit, I figureokay, well, the other fields in qf will hit or not, that's good enough.
And usually that works. But it works _differently_ when my querycontains an ampersand (or any other punctuation), result in 0 hitswhen it shoudln't, and I can't figure out why.
basically,

&defType=dismax&mm=100%&q=one : two&qf=text_field
gets hits. The ":" is thrown out the text_field, but the mm stillpasses somehow, right?
But, in the same index:
&defType=dismax&mm=100%&q=one : two&qf=text_fieldkeyword_tokenized_text_field
gets 0 hits. Somehow maybe the inclusion of thekeyword_tokenized_text_field in the qf causes dismax to calculate themm differently, decide there are three tokens in there and they allmust match, and the token ":" can never match because it's not in myindex it's stripped out... but somehow this isn't a problem unless Iinclude a keyword-tokenized field in the qf?
This is really confusing, if anyone has any idea what I'm talkingabout it and can shed any light on it, much appreciated.
The conclusion I am reaching is just NEVER include anything but amore or less ordinarily tokenized field in a dismax qf. Sadly, it wasuseful for certain use cases for me.
Oh, hey, the debugging trace woudl probably be useful:


<lstname="debug">
<strname="rawquerystring">
churchill : roosevelt
</str>
<strname="querystring">
churchill : roosevelt
</str>
<strname="parsedquery">
+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01)DisjunctionMaxQuery((isbn_t::)~0.01)DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3)DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 |text:"churchil roosevelt"~3^10.0 | title2_t:"churchilroosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 |title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchilroosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |author2_unstem:"churchill roosevelt"~3^240.0 |title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchilroosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0| subject_unstem:"churchill roosevelt"~3^80.0 |title_series_t:"churchil roosevelt"~3^40.0 |title_series_unstem:"churchill roosevelt"~3^60.0 |text_unstem:"churchill roosevelt"~3^80.0)~0.01)
</str>
<strname="parsedquery_toString">
+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3)(title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchilroosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 |author_unstem:"churchill roosevelt"~3^400.0 |title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchilroosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 |author2_unstem:"churchill roosevelt"~3^240.0 |title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchilroosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0| subject_unstem:"churchill roosevelt"~3^80.0 |title_series_t:"churchil roosevelt"~3^40.0 |title_series_unstem:"churchill roosevelt"~3^60.0 |text_unstem:"churchill roosevelt"~3^80.0)~0.01
</str>
<lstname="explain"/>
<strname="QParser">
DisMaxQParser
</str>
<nullname="altquerystring"/>
<nullname="boostfuncs"/>
<lstname="timing">
<doublename="time">
6.0
</double>
<lstname="prepare">
<doublename="time">
3.0
</double>
<lstname="org.apache.solr.handler.component.QueryComponent">
<doublename="time">
2.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.FacetComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.MoreLikeThisComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.HighlightComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.StatsComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.SpellCheckComponent">
<doublename="time">
0.0
</double>
</lst>
<lstname="org.apache.solr.handler.component.DebugComponent">
<doublename="time">
0.0
</double>
</lst>
</lst>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Reply via email to