Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Chris Hostetter Tue, 21 Jun 2011 16:30:02 -0700

: It seems like the problem is when different fields in the 'qf' produce a
: different number of tokens for a given query.  dismax needs to know the number
: of tokens in the input in order to calculate 'mm', when 'mm' is expressed as a
: percentage, or when different mm's are given for different numbers of input
: tokens.


actually the fundmental problem is that when this situation arrises, 
dismax has no way of knowing *if* you want the token that only produced a 
TermQuery in fieldA but not fieldB to counted at all.

In your case, you don't want the "&" query against your simple (non 
whitespace striping) field to count in computing minShouldMax, but how 
does dismax know that?

if someone has a field that not only strips out punctuation, but also 
ignores anything that doesn't match one of my known keywords (using the 
KeepWordsFilter) they woud want the exact oposite situation as you -- they 
are really counting on the cases where a token produces a valid query for 
that special field to be a factor, don't want the number of clauses used 
to compute minShouldMatch to be lowered artificially just all the other 
tokens in the input don't don't produce anything for that field.

bottom line: as long as one field produces a token for a chunk of input, 
that's a clause -- it may only be a clause that's queried against one 
field, but it's still a clause.

: So what if dismax could recognize that different fields were producing
: different arrity of input, and use the _smallest_ number for it's 'mm'
: calculations, instead of current behavior where it's effectively the largest
: number? (Or '1' if the smallest number is '0'?!) That would in some cases
: produce errors in the other direction -- more hits coming back than you
: naively/intuitively expect.   Not sure if that would be worse or better. Seems
: better to me, less bad failure mode.

consider my previous example, and something similar to Jira searching 
where you might have a "projectCode" field with a query time 
KeepWordsFilter that only matches project codes ... right now, a query 
like q=SOLR+foo+bar+baz&mm=100%&wf=productCode^100+text would give you 
some really nice results that match all the input, but if SOLR is a 
projectCode those issues bubble to the top -- with your proposal, the 
effective mm would be "1" (because the projectCode field would only wind 
up with the SOLR clause) and you'd get all sorts of crap -- because those 
other clauses are all still there.  so you'd get *all* project:Solr 
issues, and *all* issues matching text:foo, and *all* issues matching 
text:bar etc...

: Or better yet, but surely harder perhaps infeasible to code, it would somehow
: apply the 'mm' differently to each field. Not even sure what that means

That's pretty much impossible.  the whole nature of the dismax style 
parser is that a DisjunctionMaxQuery is computed for each "word" of the 
q, across all "fields" in the qf -- it's those DisjunctionMaxQueries that 
are wrapped in a BooleanQuery with minShouldMatch set on it...

        http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

...if you "fliped" that matrix along the diagonal to hvae a differnet mm 
per field, you'd lose the value of the field specific boosts.


Ultimately the problem you had with "&" is the same problem people have 
with stopwords, and comes down to the same thing: if you don't want some 
chunk of text to be "significant" when searchng a field in your qf, have 
your analyzer remove it -- if the analyzer for a field in the qf produces 
a token, dismax assumes it's significant to the query and factors into the 
mm and matching and scoring.


-Hoss

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

Reply via email to