On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <wellnho...@aevum.de> wrote: > On 21/11/2017 18:42, serkanmula...@gmail.com wrote:
>> 2- (same question but for multiple indexes and polysearcher) If I use >> polysearcher with 2 or more indexes, will the tf/idf scores be consistent? >> Or would they be calculated separately for each index? > > I don't know off top of my head. It's possible that indexes are searched > separately and the results are simply merged by normalized score. I'd have > to look at the code to answer the question, but maybe Marvin can chime in. The scores will be consistent. To calculate IDF for a term accurately across a composite corpus formed from multiple indexes, you need to know two things: 1. The total number of documents in the corpus. (Doc_Max()) 2. The total number of documents which contain the term. (Doc_Freq(field, term)) Both PolySearcher and ClusterSearcher calculate their doc_max on construction by summing the doc_max totals of all subsearchers. Similarly, both calculate Doc_Freq for a term by summing Doc_Freq responses for all subsearchers. https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69 https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119 https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73 https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348 This approach trades away some performance for the sake of accuracy, particularly with Doc_Freq -- query normalization takes longer when you have to wait for a lot of subsearchers to report Doc_Freq numbers for N terms. However, the alternative is occasional bizarre search results. The best anecdote I ever heard illustrating why it's important to calculate aggregate IDF consistently was an application searching a multi-shard index containing news articles split by year. If you searched for "iphone", it would be a very common term after the first release of the Apple iPhone. However, in the years prior to the Apple iPhone's release, if "iphone" existed in a shard it was likely a typo, so it would be very rare **and thus heavily weighted**. So the top hit for "iphone", without consistent IDF calculation, would be a typo'd article. (A performance improvement on this stratagem is to create a shared Doc_Freq source. So long as it contains all the common terms across all shards, it doesn't have to be updated often -- Doc_Freq values don't change very fast as indexes are updated.) Marvin Humphrey