On 23/06/2012 09:41, parnab kumar wrote:
Hi,I have crawled and indexed around 2.5 million web pages . However , almost 30 % of the pages are near duplicates . Is there any functionality in SOLR or nutch to remove those near duplicates from the index. Nutch dedup command only handles exact duplicates i guess . Exact duplicates wont serve my purpose . Please help / advise me on how to address the problem.
From experience, the problem is that many businesses effectively have multiple copies of their websites on the web because they do not use 301 redirects. This means that example.com, example.net, example.org and example.cctld may all be the same site but only differ in the domain name. The solution involves identifying which of these clone sites is actually the main site and then excluding the clones from the indexing list. Sometimes you can use in-page cues such as URL construction or Base href tags to identify the main site. However the best way to solve the clones problem is outside the main/live index.
Regards...jmcc -- ********************************************************** John McCormac * e-mail: [email protected] MC2 * web: http://www.hosterstats.com/ 22 Viewmount * Domain Registrations Statistics Waterford * And Historical DNS Database. Ireland * Over 275 Million Domains Tracked. IE * http://www.hosterstats.com/blog **********************************************************

