Christoph Goller created LUCENE-8943:
----------------------------------------

             Summary: Incorrect IDF in MultiPhraseQuery and SpanOrQuery
                 Key: LUCENE-8943
                 URL: https://issues.apache.org/jira/browse/LUCENE-8943
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/query/scoring
    Affects Versions: 8.0
            Reporter: Christoph Goller


I recently stumbled across a very old bug in the IDF computation for 
MultiPhraseQuery and SpanOrQuery.

BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
combining IDF values from more than on term / TermStatistics.

I mean the method:


Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
termStats[])


It simply adds up the IDFs from all termStats[].

This method is used e.g. in PhraseQuery where it makes sense. If we assume that 
for the phrase "New York" the occurrences of both words are independent, we can 
multiply their probabilitis and since IDFs are logarithmic we add them up. 
Seems to be a reasonable approximation. However, this method is also used to 
add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:


Similarity.SimScorer getStats(IndexSearcher searcher)

A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
positions. IDFs of alternative terms for one position should not be added up. 
Instead we should use the minimum value as an approcimation because this 
corresponds to the docFreq of the most frequent term and we know that this is a 
lower bound for the docFreq for this position.

In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
SpanWeight and adds up all IDFs of all OR-clauses.

If my arguments are not convincing, look at SynonymQuery / SynonymWeight in the 
constructor:

SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
boost) 

A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to