Re: [lucy-user] C library - Scoring mechanism

Marvin Humphrey Thu, 23 Nov 2017 11:39:38 -0800

On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <[email protected]> wrote:
> On 21/11/2017 18:42, [email protected] wrote:


>> 2- (same question but for multiple indexes and polysearcher) If I use
>> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
>> Or would they be calculated separately for each index?
>
> I don't know off top of my head. It's possible that indexes are searched
> separately and the results are simply merged by normalized score. I'd have
> to look at the code to answer the question, but maybe Marvin can chime in.

The scores will be consistent.

To calculate IDF for a term accurately across a composite corpus
formed from multiple indexes, you need to know two things:

1. The total number of documents in the corpus. (Doc_Max())
2. The total number of documents which contain the term. (Doc_Freq(field, term))

Both PolySearcher and ClusterSearcher calculate their doc_max on
construction by summing the doc_max totals of all subsearchers.
Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
responses for all subsearchers.

https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348

This approach trades away some performance for the sake of accuracy,
particularly with Doc_Freq -- query normalization takes longer when
you have to wait for a lot of subsearchers to report Doc_Freq numbers
for N terms. However, the alternative is occasional bizarre search
results.

The best anecdote I ever heard illustrating why it's important to
calculate aggregate IDF consistently was an application searching a
multi-shard index containing news articles split by year.  If you
searched for "iphone", it would be a very common term after the first
release of the Apple iPhone. However, in the years prior to the Apple
iPhone's release, if "iphone" existed in a shard it was likely a typo,
so it would be very rare **and thus heavily weighted**. So the top hit
for "iphone", without consistent IDF calculation, would be a typo'd
article.

(A performance improvement on this stratagem is to create a shared
Doc_Freq source. So long as it contains all the common terms across
all shards, it doesn't have to be updated often -- Doc_Freq values
don't change very fast as indexes are updated.)

Marvin Humphrey

Re: [lucy-user] C library - Scoring mechanism

Reply via email to