On 2010-10-25 11:22, Toke Eskildsen wrote:
> On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: 
>> But itshows a problem of distrubted search without common idf.
>> A doc will get different score in different shard.
> 
> Bingo.
> 
> I really don't understand why this fundamental problem with sharding
> isn't mentioned more often. Every time the advice "use sharding" is
> given, it should be followed with a "but be aware that it will make
> relevance ranking unreliable".

The reason is twofold, I think:

* there is an exact solution to this problem, namely to make two
distributed calls instead of one (first call to collect per-shard IDFs
for given query terms, second call to submit a query rewritten with the
global IDF-s). This solution is implemented in SOLR-1632, with some
caching to reduce the cost for common queries. However, this means that
now for every query you need to make two calls instead of one, which
potentially doubles the time to return results (for simple common
queries - for rare complex queries the time will be still dominated by
the query runtime on shard servers).

* another reason is that in many many cases the difference between using
exact global IDF and per-shard IDFs is not that significant. If shards
are more or less homogenous (e.g. you assign documents to shards by
hash(docId)) then term distributions will be also similar. So then the
question is whether you can accept an N% variance in scores across
shards, or whether you want to bear the cost of an additional
distributed RPC for every query...

To summarize, I would qualify your statement with: "...if the
composition of your shards is drastically different". Otherwise the cost
of using global IDF is not worth it, IMHO.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to