Andrzej, >>>> I have a situation where we have data indexed from two different >>>> sources into different indexes. The nature of data indexed is roughly >>>> the same. For e.g. assume that they are from crawls of two websites of >>>> book sellers. When a user fires a query, I'd like to search both >>>> indexes and match the results. That is, I'd like to point out in the >>>> results that something like Book A from Index 1 is the same as Book B >>>> from Index 2. Is there some way of doing this with Nutch or any >>>> related projects like Solr, if required implementing custom plugins ? >>> >>> If you want to implement this in Nutch searcher, then you would have to >>> modify the DistributedSearchBean where results coming from sub-searchers >>> are merged. In Solr this happens in SearchHandler. >>> >> >> Thank you. I will take a look at these classes. >> >>> The main question however is how "deep" that matching needs to go - if >>> you have 10000 hits from A and 10000 hits from B, and then present only >>> top 10, do you want to tell the user that hit #9999 from B matches hit >>> #1 from A? >>> >> >> In the current scenario, I think it is unlikely that hits below the a >> certain shallow depth are going to match. But that's just my guess >> right now. Can you please tell me how the depth impacts the solution ? >> Are you thinking about likely performance issues ? > > Yes. A naive approach would be to try and find matching documents by > querying by the matching "tag" (or whatever makes them equivalent), but > of course this is prohibitively expensive. Next best option is to > retrieve top-N from both indexes, and find matching docs within these > two topN sets. If your N=10, then a good match from index B that > happened to appear on 11-th position won't be detected - and this may be > visible if most of your users views top 10 results. But if you set N=20 > you should still find majority of matching documents (assuming their > scoring is not too far off), at a reasonable overhead, and then you can > show the top 10 and discard the rest. >
This makes a lot of sense. I understand your point now. Thanks Hemanth

