Andrzej,

>>>> I have a situation where we have data indexed from two different
>>>> sources into different indexes. The nature of data indexed is roughly
>>>> the same. For e.g. assume that they are from crawls of two websites of
>>>> book sellers. When a user fires a query, I'd like to search both
>>>> indexes and match the results. That is, I'd like to point out in the
>>>> results that something like Book A from Index 1 is the same as Book B
>>>> from Index 2. Is there some way of doing this with Nutch or any
>>>> related projects like Solr, if required implementing custom plugins ?
>>>
>>> If you want to implement this in Nutch searcher, then you would have to
>>> modify the DistributedSearchBean where results coming from sub-searchers
>>> are merged. In Solr this happens in SearchHandler.
>>>
>>
>> Thank you. I will take a look at these classes.
>>
>>> The main question however is how "deep" that matching needs to go - if
>>> you have 10000 hits from A and 10000 hits from B, and then present only
>>> top 10, do you want to tell the user that hit #9999 from B matches hit
>>> #1 from A?
>>>
>>
>> In the current scenario, I think it is unlikely that hits below the a
>> certain shallow depth are going to match. But that's just my guess
>> right now. Can you please tell me how the depth impacts the solution ?
>> Are you thinking about likely performance issues ?
>
> Yes. A naive approach would be to try and find matching documents by
> querying by the matching "tag" (or whatever makes them equivalent), but
> of course this is prohibitively expensive. Next best option is to
> retrieve top-N from both indexes, and find matching docs within these
> two topN sets. If your N=10, then a good match from index B that
> happened to appear on 11-th position won't be detected - and this may be
> visible if most of your users views top 10 results. But if you set N=20
> you should still find majority of matching documents (assuming their
> scoring is not too far off), at a reasonable overhead, and then you can
> show the top 10 and discard the rest.
>

This makes a lot of sense. I understand your point now.

Thanks
Hemanth

Reply via email to