On 2010-05-14 03:30, Hemanth Yamijala wrote: > Andrzej, > >>> I have a situation where we have data indexed from two different >>> sources into different indexes. The nature of data indexed is roughly >>> the same. For e.g. assume that they are from crawls of two websites of >>> book sellers. When a user fires a query, I'd like to search both >>> indexes and match the results. That is, I'd like to point out in the >>> results that something like Book A from Index 1 is the same as Book B >>> from Index 2. Is there some way of doing this with Nutch or any >>> related projects like Solr, if required implementing custom plugins ? >> >> If you want to implement this in Nutch searcher, then you would have to >> modify the DistributedSearchBean where results coming from sub-searchers >> are merged. In Solr this happens in SearchHandler. >> > > Thank you. I will take a look at these classes. > >> The main question however is how "deep" that matching needs to go - if >> you have 10000 hits from A and 10000 hits from B, and then present only >> top 10, do you want to tell the user that hit #9999 from B matches hit >> #1 from A? >> > > In the current scenario, I think it is unlikely that hits below the a > certain shallow depth are going to match. But that's just my guess > right now. Can you please tell me how the depth impacts the solution ? > Are you thinking about likely performance issues ?
Yes. A naive approach would be to try and find matching documents by querying by the matching "tag" (or whatever makes them equivalent), but of course this is prohibitively expensive. Next best option is to retrieve top-N from both indexes, and find matching docs within these two topN sets. If your N=10, then a good match from index B that happened to appear on 11-th position won't be detected - and this may be visible if most of your users views top 10 results. But if you set N=20 you should still find majority of matching documents (assuming their scoring is not too far off), at a reasonable overhead, and then you can show the top 10 and discard the rest. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

