Keep an eye on these open issues:

https://issues.apache.org/jira/browse/NUTCH-1324
https://issues.apache.org/jira/browse/NUTCH-1325
https://issues.apache.org/jira/browse/NUTCH-1326

They are a set of tools capable of deduplicating the various databases via the 
HostNormalizer. They collect information on hosts, most importantly the link 
score. It also collects information on duplicates within a host and then 
produce deduplication rules for the HostNormalizer based on host and duplicate 
information.

It's limited to domain because that's a larger problem in terms of resources 
and a bit easier to deal with. 
 
-----Original message-----
> From:John McCormac <[email protected]>
> Sent: Sat 23-Jun-2012 15:11
> To: [email protected]
> Subject: Re: Near Duplicate Detection in nutch /Solr
> 
> On 23/06/2012 13:17, Markus Jelsma wrote:
> > Hello,
> >
> > It maps anything to anything and has wildcard support:
> > *.example.com example.org
> > maps all URL's on the example.com domain to example.org.
> >
> 
> Thanks.
> The main problem though is still identifying the clone/original sites so 
> that the mapping can be determined.
> 
> The process I use has the advantage of having the set of websites to be 
> indexed predetermined and the clone/original problem is dealt with (for 
> the most part) before the main indexing run. It can be a complicated 
> approach depending on the number of TLDs and target countries involved.
> 
> The logic behind this approach is preventing GIGO as it is easier and 
> more efficent to solve the clone problem before it takes cycles and 
> bandwidth in the main index run.
> 
> What I have seen is that some businesses will use numbers of keyword 
> type domains pointing (without a 301 redirect) to their main site. 
> However the main clone pair is the ccTLD/.com version of a site (same 
> domain but different TLDs). The .net and .org may also exist for older 
> businesses. The non-core TLDs (biz/info/mobi/eu/asia etc) are often less 
> likely to be properly set up in DNS with a working website as about 85% 
> of a country's domain footprint will be concentrated on the ccTLD/.com axis.
> 
> Regards...jmcc
> -- 
> **********************************************************
> John McCormac  *  e-mail: [email protected]
> MC2            *  web: http://www.hosterstats.com/
> 22 Viewmount   *  Domain Registrations Statistics
> Waterford      *  And Historical DNS Database.
> Ireland        *  Over 275 Million Domains Tracked.
> IE             *  http://www.hosterstats.com/blog
> **********************************************************
> 
> 
> 

Reply via email to