Nutch now has a HostURLNormalizer capable of normalizing source hosts to a 
target host. This prevents duplication of complete websites  and bad hyperlinks.

https://issues.apache.org/jira/browse/NUTCH-1319
 
-----Original message-----
> From:John McCormac <[email protected]>
> Sent: Sat 23-Jun-2012 13:08
> To: [email protected]
> Subject: Re: Near Duplicate Detection in nutch /Solr
> 
> On 23/06/2012 09:41, parnab kumar wrote:
> > Hi,
> >
> > I have crawled and  indexed  around 2.5 million web pages . However ,
> > almost 30 % of the pages are near duplicates . Is there any functionality
> > in SOLR or nutch to remove those near duplicates from the index. Nutch
> > dedup command only handles exact duplicates i guess . Exact duplicates wont
> > serve my purpose .
> >       Please help / advise me on how to address the problem.
> 
>  From experience, the problem is that many businesses effectively have 
> multiple copies of their websites on the web because they do not use 301 
> redirects. This means that example.com, example.net, example.org and 
> example.cctld may all be the same site but only differ in the domain 
> name. The solution involves identifying which of these clone sites is 
> actually the main site and then excluding the clones from the indexing 
> list. Sometimes you can use in-page cues such as URL construction or 
> Base href tags to identify the main site. However the best way to solve 
> the clones problem is outside the main/live index.
> 
> Regards...jmcc
> -- 
> **********************************************************
> John McCormac  *  e-mail: [email protected]
> MC2            *  web: http://www.hosterstats.com/
> 22 Viewmount   *  Domain Registrations Statistics
> Waterford      *  And Historical DNS Database.
> Ireland        *  Over 275 Million Domains Tracked.
> IE             *  http://www.hosterstats.com/blog
> **********************************************************
> 
> 
> 

Reply via email to