Thank you very much Nick and Marvin. Your replies were really helpful.

On 2017-11-23 11:38, Marvin Humphrey <[email protected]> wrote: 
> On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <[email protected]> wrote:
> > On 21/11/2017 18:42, [email protected] wrote:
> 
> >> 2- (same question but for multiple indexes and polysearcher) If I use
> >> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
> >> Or would they be calculated separately for each index?
> >
> > I don't know off top of my head. It's possible that indexes are searched
> > separately and the results are simply merged by normalized score. I'd have
> > to look at the code to answer the question, but maybe Marvin can chime in.
> 
> The scores will be consistent.
> 
> To calculate IDF for a term accurately across a composite corpus
> formed from multiple indexes, you need to know two things:
> 
> 1. The total number of documents in the corpus. (Doc_Max())
> 2. The total number of documents which contain the term. (Doc_Freq(field, 
> term))
> 
> Both PolySearcher and ClusterSearcher calculate their doc_max on
> construction by summing the doc_max totals of all subsearchers.
> Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
> responses for all subsearchers.
> 
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348
> 
> This approach trades away some performance for the sake of accuracy,
> particularly with Doc_Freq -- query normalization takes longer when
> you have to wait for a lot of subsearchers to report Doc_Freq numbers
> for N terms. However, the alternative is occasional bizarre search
> results.
> 
> The best anecdote I ever heard illustrating why it's important to
> calculate aggregate IDF consistently was an application searching a
> multi-shard index containing news articles split by year.  If you
> searched for "iphone", it would be a very common term after the first
> release of the Apple iPhone. However, in the years prior to the Apple
> iPhone's release, if "iphone" existed in a shard it was likely a typo,
> so it would be very rare **and thus heavily weighted**. So the top hit
> for "iphone", without consistent IDF calculation, would be a typo'd
> article.
> 
> (A performance improvement on this stratagem is to create a shared
> Doc_Freq source. So long as it contains all the common terms across
> all shards, it doesn't have to be updated often -- Doc_Freq values
> don't change very fast as indexes are updated.)
> 
> Marvin Humphrey
> 

Reply via email to