Hey Solr folks,

Current dismax parser behavior is different for unigrams versus bigrams.

For unigrams, it's MAX-ed across fields (so called dismax), but for bigrams, it's SUM-ed from Solr 4.10 (according to https://issues.apache.org/jira/browse/SOLR-6062).

Given this inconsistency, the dilemma we are facing now is the following:
for a query with three terms: [A B C]
Relevant doc1: f1:[AB .. C] f2:[BC] // here AB in field1 and BC in field2 are bigrams, and C is a unigram Irrelevant doc2: f1:[AB .. C] f2:[AB] f3:[AB] // here only bigram AB is present in the doc, but in three different fields.

(A B C here can be e.g. "light blue bag", and doc2 can talk about "light blue coat" a lot, while only mentioning a "bag" somewhere.)

Without bigram level MAX across fields, there is no way to rank doc1 above doc2. (doc1 is preferred because it hits two different bigrams, while doc2 only hits one bigram in several different fields.)

Also, being a sum makes the retrieval score difficult to bound, making it hard to combine the retrieval score with other document level signals (e.g. document quality), or to trade off between unigrams and bigrams.

Are the problems clear?

Can someone offer a solution other than dismax for bigrams/phrases? i.e. https://issues.apache.org/jira/browse/SOLR-6600 ? (SOLR-6600 seems to be misclassified as a duplicate of SOLR-6062, while they seem to be the exact opposite.)

Thanks,
Le

PS cc'ing Jan who pointed me to the group.

Reply via email to