Nutch now has a HostURLNormalizer capable of normalizing source hosts to a target host. This prevents duplication of complete websites and bad hyperlinks.
https://issues.apache.org/jira/browse/NUTCH-1319 -----Original message----- > From:John McCormac <[email protected]> > Sent: Sat 23-Jun-2012 13:08 > To: [email protected] > Subject: Re: Near Duplicate Detection in nutch /Solr > > On 23/06/2012 09:41, parnab kumar wrote: > > Hi, > > > > I have crawled and indexed around 2.5 million web pages . However , > > almost 30 % of the pages are near duplicates . Is there any functionality > > in SOLR or nutch to remove those near duplicates from the index. Nutch > > dedup command only handles exact duplicates i guess . Exact duplicates wont > > serve my purpose . > > Please help / advise me on how to address the problem. > > From experience, the problem is that many businesses effectively have > multiple copies of their websites on the web because they do not use 301 > redirects. This means that example.com, example.net, example.org and > example.cctld may all be the same site but only differ in the domain > name. The solution involves identifying which of these clone sites is > actually the main site and then excluding the clones from the indexing > list. Sometimes you can use in-page cues such as URL construction or > Base href tags to identify the main site. However the best way to solve > the clones problem is outside the main/live index. > > Regards...jmcc > -- > ********************************************************** > John McCormac * e-mail: [email protected] > MC2 * web: http://www.hosterstats.com/ > 22 Viewmount * Domain Registrations Statistics > Waterford * And Historical DNS Database. > Ireland * Over 275 Million Domains Tracked. > IE * http://www.hosterstats.com/blog > ********************************************************** > > >

