On 23/06/2012 09:41, parnab kumar wrote:
Hi,

I have crawled and  indexed  around 2.5 million web pages . However ,
almost 30 % of the pages are near duplicates . Is there any functionality
in SOLR or nutch to remove those near duplicates from the index. Nutch
dedup command only handles exact duplicates i guess . Exact duplicates wont
serve my purpose .
      Please help / advise me on how to address the problem.

From experience, the problem is that many businesses effectively have multiple copies of their websites on the web because they do not use 301 redirects. This means that example.com, example.net, example.org and example.cctld may all be the same site but only differ in the domain name. The solution involves identifying which of these clone sites is actually the main site and then excluding the clones from the indexing list. Sometimes you can use in-page cues such as URL construction or Base href tags to identify the main site. However the best way to solve the clones problem is outside the main/live index.

Regards...jmcc
--
**********************************************************
John McCormac  *  e-mail: [email protected]
MC2            *  web: http://www.hosterstats.com/
22 Viewmount   *  Domain Registrations Statistics
Waterford      *  And Historical DNS Database.
Ireland        *  Over 275 Million Domains Tracked.
IE             *  http://www.hosterstats.com/blog
**********************************************************


Reply via email to