Comparing scores within the result set for a single query works fine. Mapping 
those to [0,1] is fine, too.

Comparing scores for different queries, or even for the same query at different 
times, isn’t valid. Showing the scores to people almost guarantees they’ll 
compare the scores between different queries.

The BM25 in Lucene is a change to the formulas for idf, tf, and length 
normalization. It is still fundamentally a tf.idf model.

https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 19, 2022, at 10:03 PM, J. Delgado <joaquin.delg...@gmail.com> wrote:
> 
> Actually, I believe that the Lucene scoring function is based on Okapi BM25 
> (BM is an abbreviation of best matching) which is based on the probabilistic 
> retrieval  
> <https://en.m.wikipedia.org/wiki/Probabilistic_relevance_model>framework 
> developed in the 1970s and 1980s by Stephen E. Robertson 
> <https://en.m.wikipedia.org/wiki/Stephen_E._Robertson>, Karen Spärck Jones 
> <https://en.m.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones>, and others.
> 
> There are several interpretations for IDF and slight variations on its 
> formula. In the original BM25 derivation, the IDF component is derived from 
> the Binary Independence Model 
> <https://en.m.wikipedia.org/wiki/Binary_Independence_Model>.
> 
> Info from: 
> https://en.m.wikipedia.org/wiki/Okapi_BM25 
> <https://en.m.wikipedia.org/wiki/Okapi_BM25>
> 
>> You could calculate an ideal score, but that can change every time a 
>> document is added to or deleted from the index, because of idf. So the ideal 
>> score isn’t a useful mental model. 
>> 
>> Essentially, you need to tell your users to worry about something that 
>> matters. The absolute value of the score does not matter.
>> 
> While I understand the concern, quite often BM25 scores are used post 
> retrieval (in 2-stage retrieval/ranking systems) to fuel learning-to-rank 
> models that often transform the score into [0,1] using some normalization 
> function that often  involves estimating a max score by looking at the score 
> distribution.
> 
> J
> 
> On Mon, Dec 19, 2022 at 11:31 AM Walter Underwood <wun...@wunderwood.org 
> <mailto:wun...@wunderwood.org>> wrote:
> That article is copied from the old wiki, so it is much earlier than 2019, 
> more like 2009. Unfortunately, the links to the email discussion are all 
> dead, but the issues in the article are still true.
> 
> If you really want to go down that path, you might be able to do it with a 
> similarity class that implements a probabilistic relevance model. I’d start 
> the literature search with this Google query.
> 
> probablistic information retrieval 
> <https://www.google.com/search?client=safari&rls=en&q=probablistic+information+retrieval&ie=UTF-8&oe=UTF-8>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev <m...@apache.org 
>> <mailto:m...@apache.org>> wrote:
>> 
>> Thanks for replym Walter.
>> Recently Robert commented on PR with the link 
>> https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages 
>> <https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages> it 
>> gives arguments against my proposal. Honestly, I'm still in doubt.  
>> 
>> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood <wun...@wunderwood.org 
>> <mailto:wun...@wunderwood.org>> wrote:
>> As you point out, this is a probabilistic relevance model. Lucene uses a 
>> vector space model.
>> 
>> A probabilistic model gives an estimate of how relevant each document is to 
>> the query. Unfortunately, their overall relevance isn’t as good as a vector 
>> space model.
>> 
> 
>> You could calculate an ideal score, but that can change every time a 
>> document is added to or deleted from the index, because of idf. So the ideal 
>> score isn’t a useful mental model. 
>> 
>> Essentially, you need to tell your users to worry about something that 
>> matters. The absolute value of the score does not matter.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev <m...@apache.org 
>>> <mailto:m...@apache.org>> wrote:
>>> 
>>> Hello dev! 
>>> Users are interested in the meaning of absolute value of the score, but we 
>>> always reply that it's just relative value. Maximum score of matched docs 
>>> is not an answer. 
>>> Ultimately we need to measure how much sense a query has in the index. e.g. 
>>> [jet OR propulsion OR spider] query should be measured like nonsense, 
>>> because the best matching docs have much lower scores than hypothetical 
>>> (and assuming absent) doc matching [jet AND propulsion AND spider].
>>> Could it be a method that returns the maximum possible score if all query 
>>> terms would match. Something like stubbing postings on virtual all_matching 
>>> doc with average stats like tf and field length and kicks scorers in? It 
>>> reminds me something about probabilistic retrieval, but not much. Is there 
>>> anything like this already?       
>>> 
>>> -- 
>>> Sincerely yours
>>> Mikhail Khludnev
>> 
>> 
>> 
>> -- 
>> Sincerely yours
>> Mikhail Khludnev
> 

Reply via email to