Thank you very much Nick and Marvin. Your replies were really helpful.
On 2017-11-23 11:38, Marvin Humphrey <[email protected]> wrote: > On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <[email protected]> wrote: > > On 21/11/2017 18:42, [email protected] wrote: > > >> 2- (same question but for multiple indexes and polysearcher) If I use > >> polysearcher with 2 or more indexes, will the tf/idf scores be consistent? > >> Or would they be calculated separately for each index? > > > > I don't know off top of my head. It's possible that indexes are searched > > separately and the results are simply merged by normalized score. I'd have > > to look at the code to answer the question, but maybe Marvin can chime in. > > The scores will be consistent. > > To calculate IDF for a term accurately across a composite corpus > formed from multiple indexes, you need to know two things: > > 1. The total number of documents in the corpus. (Doc_Max()) > 2. The total number of documents which contain the term. (Doc_Freq(field, > term)) > > Both PolySearcher and ClusterSearcher calculate their doc_max on > construction by summing the doc_max totals of all subsearchers. > Similarly, both calculate Doc_Freq for a term by summing Doc_Freq > responses for all subsearchers. > > https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69 > https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119 > https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73 > https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348 > > This approach trades away some performance for the sake of accuracy, > particularly with Doc_Freq -- query normalization takes longer when > you have to wait for a lot of subsearchers to report Doc_Freq numbers > for N terms. However, the alternative is occasional bizarre search > results. > > The best anecdote I ever heard illustrating why it's important to > calculate aggregate IDF consistently was an application searching a > multi-shard index containing news articles split by year. If you > searched for "iphone", it would be a very common term after the first > release of the Apple iPhone. However, in the years prior to the Apple > iPhone's release, if "iphone" existed in a shard it was likely a typo, > so it would be very rare **and thus heavily weighted**. So the top hit > for "iphone", without consistent IDF calculation, would be a typo'd > article. > > (A performance improvement on this stratagem is to create a shared > Doc_Freq source. So long as it contains all the common terms across > all shards, it doesn't have to be updated often -- Doc_Freq values > don't change very fast as indexes are updated.) > > Marvin Humphrey >
