Re: Merging search results from different indexes

Andrzej Bialecki Fri, 14 May 2010 00:57:24 -0700

On 2010-05-14 03:30, Hemanth Yamijala wrote:
> Andrzej,
> 
>>> I have a situation where we have data indexed from two different
>>> sources into different indexes. The nature of data indexed is roughly
>>> the same. For e.g. assume that they are from crawls of two websites of
>>> book sellers. When a user fires a query, I'd like to search both
>>> indexes and match the results. That is, I'd like to point out in the
>>> results that something like Book A from Index 1 is the same as Book B
>>> from Index 2. Is there some way of doing this with Nutch or any
>>> related projects like Solr, if required implementing custom plugins ?
>>
>> If you want to implement this in Nutch searcher, then you would have to
>> modify the DistributedSearchBean where results coming from sub-searchers
>> are merged. In Solr this happens in SearchHandler.
>>
> 
> Thank you. I will take a look at these classes.
> 
>> The main question however is how "deep" that matching needs to go - if
>> you have 10000 hits from A and 10000 hits from B, and then present only
>> top 10, do you want to tell the user that hit #9999 from B matches hit
>> #1 from A?
>>
> 
> In the current scenario, I think it is unlikely that hits below the a
> certain shallow depth are going to match. But that's just my guess
> right now. Can you please tell me how the depth impacts the solution ?
> Are you thinking about likely performance issues ?


Yes. A naive approach would be to try and find matching documents by
querying by the matching "tag" (or whatever makes them equivalent), but
of course this is prohibitively expensive. Next best option is to
retrieve top-N from both indexes, and find matching docs within these
two topN sets. If your N=10, then a good match from index B that
happened to appear on 11-th position won't be detected - and this may be
visible if most of your users views top 10 results. But if you set N=20
you should still find majority of matching documents (assuming their
scoring is not too far off), at a reasonable overhead, and then you can
show the top 10 and discard the rest.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Merging search results from different indexes

Reply via email to