On 2010-10-25 11:22, Toke Eskildsen wrote: > On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: >> But itshows a problem of distrubted search without common idf. >> A doc will get different score in different shard. > > Bingo. > > I really don't understand why this fundamental problem with sharding > isn't mentioned more often. Every time the advice "use sharding" is > given, it should be followed with a "but be aware that it will make > relevance ranking unreliable".
The reason is twofold, I think: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. However, this means that now for every query you need to make two calls instead of one, which potentially doubles the time to return results (for simple common queries - for rare complex queries the time will be still dominated by the query runtime on shard servers). * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. So then the question is whether you can accept an N% variance in scores across shards, or whether you want to bear the cost of an additional distributed RPC for every query... To summarize, I would qualify your statement with: "...if the composition of your shards is drastically different". Otherwise the cost of using global IDF is not worth it, IMHO. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com