: It seems like the problem is when different fields in the 'qf' produce a : different number of tokens for a given query. dismax needs to know the number : of tokens in the input in order to calculate 'mm', when 'mm' is expressed as a : percentage, or when different mm's are given for different numbers of input : tokens.
actually the fundmental problem is that when this situation arrises, dismax has no way of knowing *if* you want the token that only produced a TermQuery in fieldA but not fieldB to counted at all. In your case, you don't want the "&" query against your simple (non whitespace striping) field to count in computing minShouldMax, but how does dismax know that? if someone has a field that not only strips out punctuation, but also ignores anything that doesn't match one of my known keywords (using the KeepWordsFilter) they woud want the exact oposite situation as you -- they are really counting on the cases where a token produces a valid query for that special field to be a factor, don't want the number of clauses used to compute minShouldMatch to be lowered artificially just all the other tokens in the input don't don't produce anything for that field. bottom line: as long as one field produces a token for a chunk of input, that's a clause -- it may only be a clause that's queried against one field, but it's still a clause. : So what if dismax could recognize that different fields were producing : different arrity of input, and use the _smallest_ number for it's 'mm' : calculations, instead of current behavior where it's effectively the largest : number? (Or '1' if the smallest number is '0'?!) That would in some cases : produce errors in the other direction -- more hits coming back than you : naively/intuitively expect. Not sure if that would be worse or better. Seems : better to me, less bad failure mode. consider my previous example, and something similar to Jira searching where you might have a "projectCode" field with a query time KeepWordsFilter that only matches project codes ... right now, a query like q=SOLR+foo+bar+baz&mm=100%&wf=productCode^100+text would give you some really nice results that match all the input, but if SOLR is a projectCode those issues bubble to the top -- with your proposal, the effective mm would be "1" (because the projectCode field would only wind up with the SOLR clause) and you'd get all sorts of crap -- because those other clauses are all still there. so you'd get *all* project:Solr issues, and *all* issues matching text:foo, and *all* issues matching text:bar etc... : Or better yet, but surely harder perhaps infeasible to code, it would somehow : apply the 'mm' differently to each field. Not even sure what that means That's pretty much impossible. the whole nature of the dismax style parser is that a DisjunctionMaxQuery is computed for each "word" of the q, across all "fields" in the qf -- it's those DisjunctionMaxQueries that are wrapped in a BooleanQuery with minShouldMatch set on it... http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/ ...if you "fliped" that matrix along the diagonal to hvae a differnet mm per field, you'd lose the value of the field specific boosts. Ultimately the problem you had with "&" is the same problem people have with stopwords, and comes down to the same thing: if you don't want some chunk of text to be "significant" when searchng a field in your qf, have your analyzer remove it -- if the analyzer for a field in the qf produces a token, dismax assumes it's significant to the query and factors into the mm and matching and scoring. -Hoss