Jonathan: Thanks for writing that up, you're right, it is arcane....
I've starred this one! Erick > > http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html > http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ > > So to understand, first familiarize yourself with that. > > However, none of the fields involved here had any stopwords at all, so at > first it wasn't obvious this was the problem. But having different > tokenization and other analysis between fields can result in exactly the > same problem, for certain queries. > > One field in the dismax qf used an analyzer that stripped punctuation. (I'm > actually not positive at this point _which_ analyzer in my chain was > stripping punctuation, I'm using a bunch including some custom ones, but I > was aware that punctuation was being stripped, this was intentional.) > > So "monkey's" turns into "monkey". "monkey:" turns into "monkey". So far > so good. But what happens if you have punctuation all by itself seperated by > whitespace? "Roosevlet & Churchill" turns into ['roosevelt', 'churchill']. > That ampersand in the middle was stripped out, essentially _just as if_ it > were a stopword. Only two tokens result from that input. > > You can see where this is going -- another field involved in the dismax qf > did NOT strip out punctuation. So three tokens result from that input, > ['Roosevelt', '&', 'Churchill']. > > Now we have exactly the situation that gives ride the dismax stopwords > mm-behaving-funny situation, it's exactly the same thing. > > Now I've fixed this for punctuation just by making those fields strip out > punctuation, by adding these analyzers to the bottom of those > previously-not-stripping-punctuation field definitions: > > <!-- strip punctuation, to avoid dismax stopwords-like mm bug --> > <filter class="solr.PatternReplaceFilterFactory" > pattern="([\p{Punct}])" replacement="" replace="all" > /> > <!-- if after stripping punc we have any 0-length tokens, make > sure to eliminate them. We can use LengthFilter min=1 for that, > we dont' care about the max here, just a very large number. --> > <filter class="solr.LengthFilterFactory" min="1" max="100"/> > > > And things are working are how I expect again, at least for this punctuation > issue. But there may be other edge cases where differences in analysis > result in different number of tokens from different fields, which if they > are both included in a dismax qf, will have bad effects on 'mm'. > > The lesson I think, is that the only absolute safe way to use dismax 'mm', > is when all fields in the 'qf' have exactly the same analysis. But > obviously that's not very practical, it destroys much of the power of > dismax. And some differences in analysis are certainly acceptable -- but > it's rather tricky to figure out if your differences in analysis are going > to be significant for this problem, under what input, and if so fix them. It > is not an easy thing to do. So dismax definitely has this gotcha > potentially waiting for you, whenever mixing fields with different analysis > in a 'qf'. > > > On 6/14/2011 5:25 PM, Jonathan Rochkind wrote: >> >> Okay, let's try the debug trace again without a pf to be less confusing. >> >> One field in qf, that's ordinary text tokenized, and does get hits: >> >> >> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf= >> >> <str name="rawquerystring">churchill : roosevelt</str> >> <str name="querystring">churchill : roosevelt</str> >> <str name="parsedquery"> >> +((DisjunctionMaxQuery((title1_t:churchil)~0.01) >> DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) () >> </str> >> <str name="parsedquery_toString"> >> +(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) () >> </str> >> >> And that gets 25 hits. Now we add in a second field to the qf, this second >> field is also ordinarily tokenized. We expect no _fewer_ than 25 hits, >> adding another field into qf, right? And indeed it still results in exactly >> 25 hits (no additional hits from the additional qf field). >> >> >> ?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf= >> >> <str name="parsedquery"> >> +((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) >> DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) () >> </str> >> <str name="parsedquery_toString"> >> +(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | >> title1_t:roosevelt)~0.01)~2) () >> </str> >> >> >> >> Okay, now we go back to just that first (ordinarily tokenized) field, but >> add a second field in that uses KeywordTokenizerFactory. We expect this not >> neccesarily to ever match for a multi-word query, but we don't expect it to >> be fewer than 25 hits, the 25 hits from the first field in the qf should >> still be there, right? But it's not. What happened, why not? >> >> >> q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf= >> >> >> str name="rawquerystring">churchill : roosevelt</str> >> <str name="querystring">churchill : roosevelt</str> >> <str name="parsedquery">+((DisjunctionMaxQuery((isbn_t:churchill | >> title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) >> DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) >> ()</str> >> <str name="parsedquery_toString">+(((isbn_t:churchill | >> title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | >> title1_t:roosevelt)~0.01)~3) ()</str> >> >> >> >> On 6/14/2011 5:19 PM, Jonathan Rochkind wrote: >>> >>> I'm aware that using a field tokenized with KeywordTokenizerFactory is in >>> a dismax 'qf' is often going to result in 0 hits on that field -- (when a >>> whitespace-containing query is entered). But I do it anyway, for cases >>> where a non-whitespace-containing query is entered, then it hits. And in >>> those cases where it doesn't hit, I figure okay, well, the other fields in >>> qf will hit or not, that's good enough. >>> >>> And usually that works. But it works _differently_ when my query contains >>> an ampersand (or any other punctuation), result in 0 hits when it shoudln't, >>> and I can't figure out why. >>> >>> basically, >>> >>> &defType=dismax&mm=100%&q=one : two&qf=text_field >>> >>> gets hits. The ":" is thrown out the text_field, but the mm still passes >>> somehow, right? >>> >>> But, in the same index: >>> >>> &defType=dismax&mm=100%&q=one : two&qf=text_field >>> keyword_tokenized_text_field >>> >>> gets 0 hits. Somehow maybe the inclusion of the >>> keyword_tokenized_text_field in the qf causes dismax to calculate the mm >>> differently, decide there are three tokens in there and they all must match, >>> and the token ":" can never match because it's not in my index it's stripped >>> out... but somehow this isn't a problem unless I include a keyword-tokenized >>> field in the qf? >>> >>> This is really confusing, if anyone has any idea what I'm talking about >>> it and can shed any light on it, much appreciated. >>> >>> The conclusion I am reaching is just NEVER include anything but a more or >>> less ordinarily tokenized field in a dismax qf. Sadly, it was useful for >>> certain use cases for me. >>> >>> Oh, hey, the debugging trace woudl probably be useful: >>> >>> >>> <lstname="debug"> >>> <strname="rawquerystring"> >>> churchill : roosevelt >>> </str> >>> <strname="querystring"> >>> churchill : roosevelt >>> </str> >>> <strname="parsedquery"> >>> +((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) >>> DisjunctionMaxQuery((isbn_t::)~0.01) DisjunctionMaxQuery((isbn_t:roosevelt | >>> title1_t:roosevelt)~0.01))~3) DisjunctionMaxQuery((title2_unstem:"churchill >>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil >>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 | >>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil >>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | >>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill >>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 | >>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill >>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 | >>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill >>> roosevelt"~3^80.0)~0.01) >>> </str> >>> <strname="parsedquery_toString"> >>> +(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 >>> (isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) (title2_unstem:"churchill >>> roosevelt"~3^240.0 | text:"churchil roosevelt"~3^10.0 | title2_t:"churchil >>> roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 | >>> title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil >>> roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | >>> author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill >>> roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 | >>> other_number_unstem:"churchill roosevelt"~3^40.0 | subject_unstem:"churchill >>> roosevelt"~3^80.0 | title_series_t:"churchil roosevelt"~3^40.0 | >>> title_series_unstem:"churchill roosevelt"~3^60.0 | text_unstem:"churchill >>> roosevelt"~3^80.0)~0.01 >>> </str> >>> <lstname="explain"/> >>> <strname="QParser"> >>> DisMaxQParser >>> </str> >>> <nullname="altquerystring"/> >>> <nullname="boostfuncs"/> >>> <lstname="timing"> >>> <doublename="time"> >>> 6.0 >>> </double> >>> <lstname="prepare"> >>> <doublename="time"> >>> 3.0 >>> </double> >>> <lstname="org.apache.solr.handler.component.QueryComponent"> >>> <doublename="time"> >>> 2.0 >>> </double> >>> </lst> >>> <lstname="org.apache.solr.handler.component.FacetComponent"> >>> <doublename="time"> >>> 0.0 >>> </double> >>> </lst> >>> <lstname="org.apache.solr.handler.component.MoreLikeThisComponent"> >>> <doublename="time"> >>> 0.0 >>> </double> >>> </lst> >>> <lstname="org.apache.solr.handler.component.HighlightComponent"> >>> <doublename="time"> >>> 0.0 >>> </double> >>> </lst> >>> <lstname="org.apache.solr.handler.component.StatsComponent"> >>> <doublename="time"> >>> 0.0 >>> </double> >>> </lst> >>> <lstname="org.apache.solr.handler.component.SpellCheckComponent"> >>> <doublename="time"> >>> 0.0 >>> </double> >>> </lst> >>> <lstname="org.apache.solr.handler.component.DebugComponent"> >>> <doublename="time"> >>> 0.0 >>> </double> >>> </lst> >>> </lst> >>> >>> >>> >> >