Re: Near Duplicate Detection in nutch /Solr

John McCormac Sat, 23 Jun 2012 04:08:38 -0700

On 23/06/2012 09:41, parnab kumar wrote:

Hi,


I have crawled and  indexed  around 2.5 million web pages . However ,
almost 30 % of the pages are near duplicates . Is there any functionality
in SOLR or nutch to remove those near duplicates from the index. Nutch
dedup command only handles exact duplicates i guess . Exact duplicates wont
serve my purpose .
      Please help / advise me on how to address the problem.

From experience, the problem is that many businesses effectively havemultiple copies of their websites on the web because they do not use 301redirects. This means that example.com, example.net, example.org andexample.cctld may all be the same site but only differ in the domainname. The solution involves identifying which of these clone sites isactually the main site and then excluding the clones from the indexinglist. Sometimes you can use in-page cues such as URL construction orBase href tags to identify the main site. However the best way to solvethe clones problem is outside the main/live index.


Regards...jmcc
--
**********************************************************
John McCormac  *  e-mail: [email protected]
MC2            *  web: http://www.hosterstats.com/
22 Viewmount   *  Domain Registrations Statistics
Waterford      *  And Historical DNS Database.
Ireland        *  Over 275 Million Domains Tracked.
IE             *  http://www.hosterstats.com/blog
**********************************************************

Re: Near Duplicate Detection in nutch /Solr

Reply via email to