Hey Solr folks,
Current dismax parser behavior is different for unigrams versus bigrams.
For unigrams, it's MAX-ed across fields (so called dismax), but for
bigrams, it's SUM-ed from Solr 4.10 (according to
https://issues.apache.org/jira/browse/SOLR-6062).
Given this inconsistency, the dilemma we are facing now is the following:
for a query with three terms: [A B C]
Relevant doc1: f1:[AB .. C] f2:[BC] // here AB in field1 and BC in
field2 are bigrams, and C is a unigram
Irrelevant doc2: f1:[AB .. C] f2:[AB] f3:[AB] // here only bigram AB is
present in the doc, but in three different fields.
(A B C here can be e.g. "light blue bag", and doc2 can talk about "light
blue coat" a lot, while only mentioning a "bag" somewhere.)
Without bigram level MAX across fields, there is no way to rank doc1
above doc2.
(doc1 is preferred because it hits two different bigrams, while doc2
only hits one bigram in several different fields.)
Also, being a sum makes the retrieval score difficult to bound, making
it hard to combine the retrieval score with other document level signals
(e.g. document quality), or to trade off between unigrams and bigrams.
Are the problems clear?
Can someone offer a solution other than dismax for bigrams/phrases? i.e.
https://issues.apache.org/jira/browse/SOLR-6600 ? (SOLR-6600 seems to
be misclassified as a duplicate of SOLR-6062, while they seem to be the
exact opposite.)
Thanks,
Le