dismax for bigrams and phrases

Le Zhao Thu, 11 Feb 2016 09:58:53 -0800

Hey Solr folks,

Current dismax parser behavior is different for unigrams versus bigrams.

For unigrams, it's MAX-ed across fields (so called dismax), but forbigrams, it's SUM-ed from Solr 4.10 (according tohttps://issues.apache.org/jira/browse/SOLR-6062).


Given this inconsistency, the dilemma we are facing now is the following:
for a query with three terms: [A B C]

Relevant doc1: f1:[AB .. C] f2:[BC] // here AB in field1 and BC infield2 are bigrams, and C is a unigramIrrelevant doc2: f1:[AB .. C] f2:[AB] f3:[AB] // here only bigram AB ispresent in the doc, but in three different fields.

(A B C here can be e.g. "light blue bag", and doc2 can talk about "lightblue coat" a lot, while only mentioning a "bag" somewhere.)

Without bigram level MAX across fields, there is no way to rank doc1above doc2.(doc1 is preferred because it hits two different bigrams, while doc2only hits one bigram in several different fields.)

Also, being a sum makes the retrieval score difficult to bound, makingit hard to combine the retrieval score with other document level signals(e.g. document quality), or to trade off between unigrams and bigrams.


Are the problems clear?

Can someone offer a solution other than dismax for bigrams/phrases? i.e.https://issues.apache.org/jira/browse/SOLR-6600 ? (SOLR-6600 seems tobe misclassified as a duplicate of SOLR-6062, while they seem to be theexact opposite.)


Thanks,
Le

PS cc'ing Jan who pointed me to the group.

dismax for bigrams and phrases

Reply via email to